Re: 32GB limit per swap device?

2011-08-23 Thread Matthew Dillon
Two additional pieces of information.

The original limitation was more related to DEV_BSIZE calculations for
the buf/bio, which is now 64-bits and thus not applicable, though you
probably need some preemptive casts to ensure the multiplication is
done in 64-bits.  There was also another intermediate calculation
overflow in the swap radix-tree code which had to be fixed to be able
to use the full range... I think w/simple casts.  I haven't looked it
up but there should be a few commits in the DFly codebase that can
be referenced.

Second item:  The main physical memory use is not the radix tree bitmap
code for the swap code, but instead the auxillary data structure used
to store the swapblk information which is associated with the vm_object
structure.  This structure contains a short array of swap block
assignments (as a memory optimization to reduce header overhead) and
it is these fields which you really want to keep 32-bits (unless you
want the ~1MB per ~1GB of swap to become ~2MB per ~1GB of swap in
physical memory overhead).  The block number is in page-sized chunks
so the practical limit is still ~4TB, with a further caveat below.

The further caveat is that the actual limitation for the radix tree
is 0x4000 blocks, which is 1/4 the full range or ~1TB, so the
actual limitation for the (fixed) original radix tree code is ~1TB
rather than ~4TB.  This restricted range is due to some shift  
operators used in the radix tree code that I didn't want to make more
complicated.

So, my recommendation is to fix the intermediate calculations and keep
the swapblk related blockno fields 32 bits.

The preallocation for the vm_object's auxillary structure must be large
enough to actually be able to fill up swap and assign all the swap blocks.
This is what eats the physical memory (4 bytes per 4K = 1024x storage
factor).  The radix tree bitmap itself winds up eating only around
2 bits per swap block in total overhead.  So the auxillary structure is
the main culprit.  You definitely want to keep those block number fields
in the aux structure 32 bits.

The practical limit of ~1TB of swap requires ~1GB of preallocated
physical memory with a 32 bit block number field.  That would become
~2GB of preallocated memory if 64 bit block numbers were used instead,
for no gain other than wasting physical memory.  Ok, nobody is likely
to actually need that much swap but people might be surprised, there are
a lot of modern-day uses for swap space that don't involve heavy paging
of anonymous memory.

-Matt
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 32GB limit per swap device?

2011-08-22 Thread Matthew Dillon
The limitation was ONLY due to a *minor* 32-bit integer overflow in one
or two *intermediate* calculations in the radix tree code, which I
long ago fixed in DragonFly.

Just find the changes in the DFly codebase and determine if they need
to be applied.

The swap space radix code (which I wrote long ago) is in page-sized
blocks, so you actually probably want to keep using a 32-bit integer for
the block number there to keep the physical memory reservation required
for the radix tree low.  If you just pop the base block id up to 64 bits
without adjusting the radix code to overlay a 64 bit bitmap on it you
waste a lot of physical memory for the same amount of swap reservation.
This is NOT where the limitation lies.  It was strictly an intermediate
calculation that caused the original limitation.

With 32 bit block numbers stored in the radix tree nodes in the swap
code the physical limitation is something like 1 to 4 TB of total swap.
I forget exactly but it is at least 1TB.  I've tested 1TB swap partitions
on DragonFly with just the minor fixes to the original radix tree code.

--

Also note that I believe FreeBSD has done away with the interleaved swap.
I'm not sure why, I guess geom can interleave the swap for you but I've
always thought that it would be easier to just specify and add the
partitions separately so one has the flexibility to swapon and swapoff
the individual partitions on a live system.  Interleaving is important
because you get an almost perfect performance multiplier.  You don't
want to just append the swap partitions after each other.

--

One last thing:  The amount of wired physical memory required is still
on the order of ~1MB per ~1GB of swap.  A 32-bit kernel is thus still
limited by available KVM, effectively limiting you to around ~32G of
swap depending on various factors if you do not want to run the system
out of KVM.  I've run upwards of 128G of swap on 32-bit systems but it
really pushed the KVM use and I would not recommend it.

A 64-bit kernel is *NOT* limited by KVM.  Swap is effectively limited to
~1TB or ~2TB using the original radix code with the one or two intermediate
overflow fixes applied.  The daddr_t in the original radix code can remain
32-bits (in DragonFly I typedef'd another name so I could explicitly make
it 32-bits regardless of daddr_t).

Large amounts of swap space are becoming important as things like tmpfs
(and swapcache in DragonFly as well) can really make use of it.  Swap
performance (the ability to interleave the swap space) is also important
for the same reason.  Interleaved swap on two SATA-III SSDs is just
insane... gives you something like 800MB/sec of aggregate read bandwidth.

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: PCIe SATA HBA for ZFS on -STABLE

2011-06-06 Thread Matthew Dillon
:I'm not on the -STABLE list so please reply to me.
:
:I'm using an Intel Core i3-530 on a Gigabyte H55M-D2H motherboard with 8 x
:2TB drives  2 x 1TB drives.
:The plan is to have the 1 TB drives in a zmirror and the 8 in a raidz2.
:
:Now the Intel chipset has only 6 on board SATA II ports so ideally I'm
:looking for a non RAID SATA II HBA to give me 6 extra ports (4 min).
:Why 6 extra ?
:Well the case I'm using has 2 x eSATA ports so 6 would be ideal, 5 OK, and 4
:the minimum I need to do the job.
:
:So...
:
:What do people recommend for 8-STABLE as a PCIe SATA II HBA for someone
:using ZFS ?
:
:Not wanting to break the bank.
:Not interested in SATA III 6GB at this time... though it could be useful if
:I add an SSD for... (is it ZIL ?).
:Can this be added at any time ?
:
:The main issue is I need at least 10 ports total for all existing drives...
:ZIL would require 11 so ideally we are talking a 6 port HBA.

The absolute cheapest solution is to buy a Sil-3132 PCIe card
(providing 2 E-SATA ports), and then connect an external port multiplier
to each port.  External port multiplier enclosures typically support
5 drives each so that would give you your 10 drives.

Even the 3132 is a piss-ant little card it does support FIS-Based
switching so performance will be very good... it will just be limited
to SATA-II speeds is all.

Motherboard AHCI-based SATA ports typically do NOT have FIS-Based
switching support (this would be the FBSS capability flag when the AHCI
driver probes the chipset).  This means that while you can attach an
external port multiplier enclosure to mobo SATA ports (see later
on E-SATA vs SATA), read performance from multiple drives concurrently
will be horrible.  Write performance will still be decent due to drive
write caches despite being serialized.

On E-SATA vs SATA.  Essentially there are only two differences between
E-SATA and SATA.  One is the cable and connector format.  The other is
hot swap detection.  Most mobo SATA ports can be strung out to E-SATA
with an appropriate adapter.  High-end Intel ASUS mobos often come with
such adapters (this is why they usually don't sport an actual E-Sata
port on the backplane) and the BIOS has setup features to specify E-SATA
on a port-by-port basis.

--

For SSDs you want to directly connect the SSD to a mobo SATA port and
then either mount the SSD in the case or mount it in a hot-swap gadget
that you can screw into a PCI slot (it doesn't actually use the PCI
connector, just the slot).  A SATA-III port with a SATA-III SSD really
shines here and 400-500 MBytes/sec random read performance from a single
SSD is possible, but it isn't an absolute requirement.  A SATA-II port
will still work fine as long as you don't mind maxing out the bandwidth
at 250 MBytes/sec.

--

I can't help with any of the other questions.  Someone also suggested
the MPS driver for FreeBSD, with caveats.

I'll add a caveat on the port multiplier enclosures.  Nearly all such
enclosures use another SIL chipset internally and it works pretty well
EXCEPT that it isn't 100% dependable if you try to hot-swap drives in the
enclosure while other drives in the enclosure are active.  So with that
caveat, I recommend the port multiplier enclosure as the cheapest solution.

To get robust hot-swap enclosures you either need to go with SAS or you
need to go with discrete SATA ports (no port multiplication), and the
ports have to support hot-swap.  The best hot-swap support for an AHCI
port is if the AHCI chipset supports cold-presence-detect (CPD), and
again Mobo AHCI chipsets usually don't.  Hot-swap is a bit hit or miss
without CPD because power savings modes can effectively prevent hot-swap
detect from working properly.  Drive disconnects will always be detected
but drive connects might not be.

And even with discrete SATA ports the AHCI firmware on mobos does not
necessarily handle hot-swap properly.  For example my Intel-I7 ASUS mobo
will generate spurious interrupts and status on a DIFFERENT discrete
SATA port when I hot swap on some other discrete SATA port, in addition
to generating the status interrupt on the correct port.  So then it comes
down to the driver in the operating system properly handling the
spurious status and properly stopping and restarting pending commands
when necessary.   So, again, it is best for the machine to be idle before
attempting a hot-swap.

Lots of caveats.  Sorry... you can blame Intel for all the blasted issues
with AHCI and SATA.  Intel didn't produce a very good chipset spec and
vendors took all sorts of liberties.

-Matt
Matthew Dillon 
dil...@backplane.com

Re: Constant rebooting after power loss

2011-04-03 Thread Matthew Dillon
: Do you know if that's changed at all with NCQ on modern SATA drives? 
: I've seen people commenting that using tags recovers most, if not all, 
: of the performance lost by disabling the write cache.
:...

I've never tried that combination.  Theoretically the 32 tags SATA
supports would just barely be enough for sequential write service
loads but I really doubt it would be enough for mixed service loads
and you would be blowing up your read performance to achieve even
that due to the length of time the tags stay busy with writes.

With some driver massaging, such as partitioning the tag space
and dedicating a specific number of tags for writing, read performance
could probably be maintained but write performance (with caches off)
would definitely still suffer.  It might not horrible, though.

One advantage of turning off the drive's write cache is that it
would be possible for the OS to control write interference vs read
loads, which is impossible to do with caches turned on.  That is,
with caches turned on your writes are instantly acknowledged until
the drive's own caches exceed their dirty limits and by that time
the drive is juggling so much dirty data that we (the OS/driver)
have no control over read vs write performance.  This is why it is
so blasted difficult to write I/O schedulers in OS's that actually
work.

With caches disabled the OS/driver would have a great deal more control
over read vs write performance.  I/O scheduling would become viable.
But to really make it work well I think we would need 64-128 tags
(or more) to be able to cover multiple writing zones.  With only 32 tags
the drive's zone cache will be defeated.

It would be a very interesting test.  I can't immediately dismiss tagged
I/O with write caches disabled.

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Constant rebooting after power loss

2011-04-02 Thread Matthew Dillon
.

-Matt
Matthew Dillon 
dil...@backplane.com
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Constant rebooting after power loss

2011-04-01 Thread Matthew Dillon
The core of the issue here comes down to two things:

First, a power loss to the drive will cause the drive's dirty write cache
to be lost, that data will not make it to disk.  Nor do you really want
to turn of write caching on the physical drive.  Well, you CAN turn it
off, but if you do performance will become so bad that there's no point.
So turning off the write caching is really a non-starter.

The solution to this first item is for the OS/filesystem to issue a
disk flush command to the drive at appropriate times.  If I recall the
ZFS implementation in FreeBSD *DOES* do this for transaction groups,
which guarantees that a prior transaction group is fully synced before
a new ones starts running (HAMMER in DragonFly also does this).
(Just getting an 'ack' from the write transaction over the SATA bus only
means the data made it to the drive's cache, not that it made it to
the platter).

I'm not sure about UFS vis-a-vie the recent UFS logging features...
it might be an option but I don't know if it is a default.  Perhaps
someone can comment on that.

One last note here.  Many modern drives have very large ram caches.
OCZ's SSDs have something like 256MB write caches and many modern HDs
now come with 32MB and 64MB caches.  Aged drives with lots of relocated
sectors and bit errors can also take a very long time to perform writes
on certain sectors.  So these large caches take time to drain and one
can't really assume that an acknowledged write to disk will actually
make it to the disk under adverse circumstances any more.  All sorts
of bad things can happen.

Finally, the drives don't order their writes to the platter (you can
set a bit to tell them to, but like many similar bits in the past there
is no real guarantee that the drives will honor it).  So if two
transactions do not have a disk flush command inbetween them it is
possible for data from the second transaction to commit to the platter
before all the data from the first transaction commits to the platter.
Or worse, for the non-transactional data to update out of order relative
to the transactional data which was supposed to commit first.

Hence IMHO the OS/filesystem must use the disk flush command in such
situations for good reliability.

--

The second problem is that a physical loss of power to the drive can
cause the drive to physically lose one or more sectors, and can even
effectively destroy the drive (even with the fancy auto-park)... if the
drive happens to be in the middle of a track write-back when power is
lost it is possible to lose far more than a single sector, including
sectors unrelated to recent filesystem operations.

The only solution to #2 is to make sure your machines (or at least the
drives if they happen to be in external enclosures) are connected to
a UPS and that the machines are communicating with the UPS via
something like the apcupsd port.  AND also that you test to make
sure the machines properly shut themselves down when AC is lost before
the UPS itself runs out of battery time.  After all, a UPS won't help
if the machines don't at least idle their drives before power is lost!!!

I learned this lesson the hard way about 3 years ago.  I had something
like a dozen drives in two raid arrays doing heavy write activity and
lost physical power and several of the drives were totally destroyed,
with thousands of sector errors.  Not just one or two... thousands.

(It is unclear how SSDs react to physical loss of power during heavy
writing activity.  Theoretically while they will certainly lose their
write cache they shouldn't wind up with any read errors).

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: How to bind a static ether address to bridge?

2011-02-25 Thread Matthew Dillon
If you can swing a routed network that will definitely have the fewest
complications.

For a switched network if_bridge and ARP have to be integrated, something
I just finished doing in DragonFly, so that all member interfaces of the
bridge use *only* the bridge's MAC for all transactions, including ARP
transactions, whether they require forwarding through the bridge or not.

The bridge has its own internal forwarding table and a great deal of
confusion occurs if the normal ARP code is trying to tie into individual
interfaces instead of just the bridge interface, for *ANY* member of
the bridge, not just the first member of the bridge.

Some of the problems you are likely to hit using if_bridge:

* ARP response flows in on member interface A with an ether destination
  of member interface B.  OS decides to record the ARP route as coming
  from interface B (when it's actually coming from interface A),
  while the bridge internally records the proper forwarding (A).
  Fireworks ensue.

* ARP responses targetting member interfaces which are part of the
  spanning tree protocol (when you have redundant links), and then
  wind up in the blocking state by the spanning tree protocol.

  The if_bridge code in FreeBSD sets the bridge's MAC to be the
  same as the first added interface, which is usually your LAN
  ethernet port.  This will help a bit, just make sure that it *IS*
  your LAN ethernet port and that the spanning tree protocol is *NOT*
  turned on for that port.

  However, other member interfaces (usually TAPs if you are using
  something like OpenVPN) will have different MAC addresses and that
  will cause confusion.

  It might be possible to work around both issues by setting the MAC for
  *ALL* member interfaces to be the same as the bridge MAC, but I don't
  know.  I gave up trying to do that in DFly and instead modified the ARP
  code to always use the bridge MAC for any interface which is a member of
  a bridge.  That appears to have worked quite well.

  My home network (using DragonFly) is using if_bridge to a colocated box,
  ether bridging a class C over three WANs via OpenVPN, with the related
  TAP interfaces and the LAN interface as members of the bridge.  The
  bridge is set up with the spanning tree protocol turned on for the three
  TAP interfaces and with bonding turned on for two of the TAP interfaces.
  But that's with DFly (and I just finished the work two days ago).
  If something similar cannot be done w/FreeBSD then I recommend porting
  the changes from DFly over to FreeBSD's bridging and ARP modules.

  It was a big headache but once I cleared up the ARP confusion things just
  started magically working.

  Other caveats:

* TAP and BRIDGE interfaces are assigned a nearly random MAC address
  when they are created (in FreeBSD the bridge sets its MAC to the
  first member interface so that is at least ok if you always add your
  LAN as the first member interface, however the other member interfaces
  aren't so lucky).  Rebooting the machine containing the bridge or
  destroying and rebuilding the bridge can create total and absolute
  havoc on your network because the rest of your switching
  infrastructure and machines will have the old MACs cached.

  The partial solution is taking on the MAC address of the LAN interface,
  which FreeBSD's bridging code does, and it might be possible to also
  set the other member interfaces to that same MAC (but I don't know if
  that will work).  If not then this is almost a non-solvable problem
  short of making the ARP module more aware of the bridge.

* If using redundant links without bonding support in the bridge code
  the bridge itself will get confused when the topology changes, though
  if it is a simple topology the bridge should be able to start forwarding
  to the backup link even though its internal forwarding table is messed
  up.

  The concept of a 'backup' link is a bit of a hack in the STP code
  (just as the concept of 'bonding' is a bit of a hack), so how well it
  works will depend on a lot of different factors.  The idea of a
  'backup' link is to be able to continue to switch packets when only
  one path is available even if that path has not been completely
  resolved through the STP protocol.

* ARP only works because *EVERYONE* uses the same timeout.  Futzing
  around with member associations on the bridge will cause the bridge
  to forget.  The bridge should theoretically broadcast unicast packets
  for which it doesn't have a forwarding entry but... well, it is still
  possible for machines to get confused.

  When working on your setup you may have to 'arp -d -a' on one or
  more machines multiple times to force them to re-arp and cause all
  your intermediate ethernet switches to 

Re: vm.swap_reserved toooooo large?

2010-12-20 Thread Matthew Dillon
One of the problems with resource management in general is
that it has traditionally been per-process, and due to the
multiplicative effect (e.g. max-descriptors * limit-per-descriptor),
per-process resources cannot be set such that any given user is
prevented from DDOSing the system without making them so low that
normal programs begin to fail for no good reason.

Hence the advent of per-user and other more suitable resource
limits, nominally set via sysctl.  Even with these, however,
it is virtually impossible to protect against a user DDOS.
The kernel itself has resource limitations which are fairly easy
to blow out... mbufs are usually the easiest to blow up, followed
by pipe KVM memory.  Filesytems can be blown up too by creating
sparse files and mmap()ing them (thus circumventing normal overcommit
limitations).

Paging just itself, without running the system out of VM, can destroy
a machine's performance and be just as effective a DDOS attack as
resource starvation is.

Virtual memory resources are similarly impacted.  Overcommit limiting
features have as many downsides as they have upsides.  Its an endless
argument but I've seen systems blow up with overcommit limits set even
more readily than with no (overcommit) limits set.  Theoretically
overcommit limits make the system more manageable but in actual practice
they only work when the application base is written with such limits
in mind (and most are not).  So for a general purpose unix environment
putting limits on overcommit tends to create headaches.  To be sure, in
a turn-key environment overcommit serves a very important function.  In
a non-turn-key environment however it will likely create more problems
than it will solve.

The only way to realistically deal with the mess, if it is important
to you, is to partition the systems' real resources and run stuff
inside their own virtualized kernels each of which does its own
independent resource management and whos I/O on the real system can
be well-controlled as an aggregate.

Alternatively, creating very large swap partitions work very well to
mitigate the more common problems.  Swap itself is changing its function.
Swap is no longer just used for real memory overcommit (in fact,
real memory overcommit is quite uncommon these days).  It is now also
used for things like tmpfs, temporary virtual disks, meta-data
caching, and so forth.  These days the minimum amount of swap I
configure is 32G and as efficient swap storage gets more cost effective
(e.g. SSDs), significantly more.  70G, 110G, etc.

It becomes more a matter of being able to detact and act on the
DDOS/resource issue BEFORE it gets to the point of killing important
processes (definition: whatever is important for the functioning of
that particular machine, user-run or root-run), and less a matter of
hoping the system will do the right thing when the resource limit is
actually reached.  Having a lot of swap gives you more time to act.

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Make ZFS auto-destroy snapshots when the out of space?

2010-05-30 Thread Matthew Dillon
It is actually a security issue to automatically destroy snapshots based
on whether a filesystem is full, even automatically generated snapshots.
Since one usually implements snapshots to perform a function you wish
to rely on, such as to retain backups of historical data for auditing
or other purposes, you do not want an attacker to be able to indirectly
destroy snapshots simply by filling up the filesystem.

Instead what you want to do is to treat both the automatic and the manual
snapshots as an integrated part of the filesystem's operation.  Just as
we have to deal with a nominal non-snapshotted filesystem-full condition
today we also want to treat a filesystem with multiple snapshots in the
same vein.  So, for example, you might administratively desire 60 1-day
snapshots plus 10 minute snapshots for the most recent 3 days to be
retained at all times.  The automatic maintainance of the snapshots
would then administratively delete snapshots over 60 days old and prune
to a coarser grain past 3 days.

The use of snapshots on modern filesystem capable of managing large
numbers of snapshots relatively pain-free, particularly on large storage
systems and/or on modern multi-terrabyte HDs, requires a bit of a change
in thinking.  You have to stop thinking of the snapshots as optional and
start thinking of them as mandatory.

When snapshot availability is an assumed condition and not an
exceptional or special-case condition it opens up a whole new arena
in how filesystems can be managed, backed-up, audited, and used in
every-day work.  Once your thinking processes change you'll never
go back to non-snapshotted or nontrivially-snapshotted filesystems.

And you will certainly not want to allow a filesystem being mistakenly
filled up to destroy your precious snapshots :-)

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: hardware for home use large storage

2010-02-10 Thread Matthew Dillon
:Correction -- more than likely on a consumer motherboard you *will not*
:be able to put a non-VGA card into the PCIe x16 slot.  I have numerous
:Asus and Gigabyte motherboards which only accept graphics cards in their
:PCIe x16 slots; this feature is documented in user manuals.  I
:don't know how/why these companies chose to do this, but whatever.
:
:I would strongly advocate that the OP (who has stated he's focusing on
:stability and reliability over speed) purchase a server motherboard that
:has a PCIe x8 slot on it and/or server chassis (usually best to buy both
:of these things from the same vendor) and be done with it.
:
:-- 
:| Jeremy Chadwick   j...@parodius.com |

It is possible this is related to the way Intel on-board graphics
work in recent chipsets.  e.g. i915 or i925 chipsets.  The
on-motherboard video uses a 16-lane internal PCI-e connection which
is SHARED with the 16-lane PCI-e slot.  If you plug something into
the slot (e.g. a graphics card), it disables the on-motherboard
video.  I'm not sure if the BIOS can still boot if you plug something
other than a video card into these MBs and no video at all is available.
Presumably it should be able to, you just wouldn't have any video at
all.

Insofar as I know AMD-based MBs with on-board video don't have this
issue, though it should also be noted that AMD-based MBs tend to be
about 6-8 months behind Intel ones in terms of features.

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: hardware for home use large storage

2010-02-09 Thread Matthew Dillon
The Silicon Image 3124A chipsets (the PCI-e version of the 3124.  The
original 3124 was PCI-x).  The 3124A's are starting to make their way
into distribution channels.  This is probably the best 'cheap' solution
which offers fully concurrent multi-target NCQ operation through a port
multiplier enclosure with more than the PCIe 1x bus the ultra-cheap
3132 offers.  I think the 3124A uses an 8x bus (not quite sure, but it
is more than 1x).

AHCI on-motherboard with equivalent capabilities do not appear to be
in wide distribution yet.  Most AHCI chips can do NCQ to a single
target (even a single target behind a PM), but not concurrently to
multiple targets behind a port multiplier.  Even though SATA bandwidth
constraints might seem to make this a reasonable alternative it
actually isn't because any seek heavy activity to multiple drives
will be serialized and perform EXTREMELY poorly.  Linear performance
will be fine.  Random performance will be horrible.

It should be noted that while hotswap is supported with silicon image
chipsets and port multiplier enclosures (which also use Sili chips in
the enclosure), the hot-swap capability is not anywhere near as robust
as you would find with a more costly commercial SAS setup.  SI chips
are very poorly made (this is the same company that went bust under
another name a few years back due to shoddy chipsets), and have a lot
of on-chip hardware bugs, but fortunately OSS driver writers (linux
guys) have been able to work around most of them.  So even though the
chipset is a bit shoddy actual operation is quite good.  However,
this does mean you generally want to idle all activity on the enclosure
to safely hot swap anything, not just the drive you are pulling out.
I've done a lot of testing and hot-swapping an idle disk while other
drives in the same enclosure are hot is not reliable (for a cheap port
multiplier enclosure using a Sili chip inside, which nearly all do).

Also, a disk failure within the enclosure can create major command
sequencing issues for other targets in the enclosure because error
processing has to be serialized.  Fine for home use but don't expect
miracles if you have a drive failure.

The Sili chips and port multiplier enclosures are definitely the
cheapest multi-disk solution.  You lose on aggregate bandwidth and
you lose on some robustness but you get the hot-swap basically for free.

--

Multi-HD setups for home use are usually a lose.  I've found over
the years that it is better to just buy a big whopping drive and
then another one or two for backups and not try to gang them together
in a RAID.  And yes, at one time in the past I was running three
separate RAID-5 using 3ware controllers.  I don't anymore and I'm
a lot happier.

If you have more than 2TB worth of critical data you don't have much
of a choice, but I'd go with as few physical drives as possible
regardless.  The 2TB Maxtor green or black drives are nice.  I
strongly recommend getting the highest-capacity drives you can
afford if you don't want your power bill to blow out your budget.

The bigger problem is always having an independent backup of the data.
Depending on a single-instanced filesystem, even one like ZFS, for a
lifetime's worth of data is not a good idea.  Fire, theft... there are
a lot of ways the data can be lost.  So when designing the main
system you have to take care to also design the backup regimen
including something off-site (or swapping the physical drive once
a month, etc). i.e. multiple backup regimens.

If single-drive throughput is an issue then using ZFS's caching
solution with a small SSD is the way to go (and yes, DFly has a SSD
caching solution now too but that's not pertainant to this thread).
The Intel SSDs are really nice, but I am singularly unimpressed with
the OCZ Colossus's which don't even negotiate NCQ.  I don't know much
re: other vendors.

A little $100 Intel 40G SSD has around a 40TB write endurance and can
last 10 years as a disk meta-data caching environment with a little care,
particularly if you only cache meta-data.  A very small incremental
cost gives you 120-200MB/sec of seek-agnostic bandwidth which is
perfect for network serving, backup, remote filesystems, etc.  Unless
the box has 10GigE or multiple 1xGigE network links there's no real
need to try to push HD throughput beyond what the network can do
so it really comes down to avoiding thrashing the HDs with random seeks.
That is what the small SSD cache gives you.  It can be like night and
day.

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To 

Re: immense delayed write to file system (ZFS and UFS2), performance issues

2010-01-26 Thread Matthew Dillon
:I'm experiencing the same thing, except in my case it's most noticeable 
:when writing to a USB flash drive with a FAT32 filesystem.  It slows the 
:entire system down, even if the data being written is coming from cache 
:or a memory file system.
:
:I don't know if it's related.  I'm running 8-STABLE from about 4 December.
:
:Regards,
:Aragon

I don't know re: the main thread but in regards to writing to a USB
flash drive interfering with other operations the most likely cause
is that the buffer cache fills up with dirty buffers destined for the
(slow) USB drive.  This causes other unrelated drive subsystems
to block on the buffer cache.

There are no easy answers.  A poor-man's solution would be to limit
dirty buffers in the buffer cache to 80% of the nominal dirty maximum
on a per-mount basis so no single mount can kill the buffer cache.
(One can't just cut-up the buffer cache as that would leave too few
buffers available for each mount to operate efficiently).  A per-mount
minimum buffer guarantee would also help smooth things out but the
value would have to be small (comprise no more than 20% of the buffer
cache in aggregate).

In the case of UFS the write-behind code is asynchronous, so even
though UFS wants to flush the buffers out all that happens in reality
when writing to slow media is that the dirty buffers wind up on
the I/O queue (which is actually worse then leaving them B_DELWRI in
the buffer cache because now the VM pages are all soft-busied).

-Matt
Matthew Dillon 
dil...@backplane.com
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: immense delayed write to file system (ZFS and UFS2), performance issues

2010-01-26 Thread Matthew Dillon
Here's what I got from one of my 2TB WD green drives.  This one
is Firmware 01.00A01.  Load_Cycle_Count is 26... seems under
control.

It gets hit with a lot of activity separated by a lot of time
(several minutes to several hours), depending on what is going on.
The box is used for filesystem testing.  Regardless it seems to
stay spun-up all the time, or nearly all the time.

Neither the BIOS nor the kernel driver is messing with the SUD
control on the Silicon Image board it is connected to (other
then just turning it on and leaving it that way).  If the
drive has an intelligent parking function it doesn't seem to
be using it much.  I haven't specifically disabled any such
function.

Device Model: WDC WD20EADS-00R6B0
Serial Number:WD-WCAVY0259672
Firmware Version: 01.00A01
User Capacity:2,000,398,934,016 bytes
Device is:Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:Tue Jan 26 19:25:48 2010 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

...
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x002f   200   200   051Pre-fail  Always   
-   0
  3 Spin_Up_Time0x0027   212   150   021Pre-fail  Always   
-   6375
  4 Start_Stop_Count0x0032   100   100   000Old_age   Always   
-   39
  5 Reallocated_Sector_Ct   0x0033   200   200   140Pre-fail  Always   
-   0
  7 Seek_Error_Rate 0x002e   200   200   000Old_age   Always   
-   0
  9 Power_On_Hours  0x0032   095   095   000Old_age   Always   
-   4252
 10 Spin_Retry_Count0x0032   100   253   000Old_age   Always   
-   0
 11 Calibration_Retry_Count 0x0032   100   253   000Old_age   Always   
-   0
 12 Power_Cycle_Count   0x0032   100   100   000Old_age   Always   
-   37
192 Power-Off_Retract_Count 0x0032   200   200   000Old_age   Always   
-   13
193 Load_Cycle_Count0x0032   200   200   000Old_age   Always   
-   26
194 Temperature_Celsius 0x0022   121   111   000Old_age   Always   
-   31
196 Reallocated_Event_Count 0x0032   200   200   000Old_age   Always   
-   0
197 Current_Pending_Sector  0x0032   200   200   000Old_age   Always   
-   0
198 Offline_Uncorrectable   0x0030   200   200   000Old_age   Offline  
-   0
199 UDMA_CRC_Error_Count0x0032   200   200   000Old_age   Always   
-   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000Old_age   Offline  
-   0

I have a few of these babies strewn around.  The others show about
the same stats, e.g. this one is used in a production box.  Same
drive type, same firmware:

ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
  9 Power_On_Hours  0x0032   095   095   000Old_age   Always   
-   4164
 12 Power_Cycle_Count   0x0032   100   100   000Old_age   Always   
-   43
193 Load_Cycle_Count0x0032   200   200   000Old_age   Always   
-   26
...

So on the face of it things seem ok with these drives.  Presumably WD
is working adjustments into the firmware as time goes on.  Hopefully
they aren't just masking the count in the SMART page to appease
techies :-)

These particular WDs (2TB Caviar Green's) are slow drives.  5600 rpm,
100MB/sec.  But they are also very quiet in operation and seem to
be quite power efficient.

-Matt
Matthew Dillon 
dil...@backplane.com
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: An old gripe: Reading via mmap stinks

2010-01-14 Thread Matthew Dillon
:   mmap:   43.400u 9.439s 2:35.19 34.0%16+184k 0+0io 106994pf+0w
:   read: 41.358u 23.799s 2:12.04 49.3%   16+177k 67677+0io 0pf+0w
:
:Observe, that even though read-ing is quite taxing on the kernel (high 
:sys-time), the mmap-ing loses overall -- at least, on an otherwise idle 
:system -- because read gets the full throughput of the drive (systat -vm 
:shows 100% disk utilization), while pagefaulting gets only about 69%.
:
:When I last brought this up in 2006, it was revealed, that read(2) 
:uses heuristics to perform a read-ahead. Why can't the pagefaulting-in 
:implementation use the same or similar trickery was never explained...

Well, the VM system does do read-ahead, but clearly the pipelining
is not working properly because if it were then either the cpu or
the disk would be pegged, and neither is.

It's broken in DFly too.  Both FreeBSD and DragonFly use
vnode_pager_generic_getpages() (UFS's ffs_getpages() just calls
the generic) which means (typically) the whole thing devolves into
a UIO_NOCOPY VOP_READ().  The VOP_READ should be doing read-ahead
based on the sequential access heuristic but I already see issues
in both implementations of vnode_pager_generic_getpages() where it
finds a valid page from an earlier read-ahead and stops (failing to
issue any new read-aheads because it fails to issue a new UIO_NOCOPY
VOP_READ... doh!).

This would explain why the performance is not as bad as linux but
is not as good as a properly pipelined case.  I'll play with it
some in DFly and I'm sure the FreeBSD folks can fix it in FreeBSD.

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: incorrect usleep/select delays with HZ 2500

2009-09-21 Thread Matthew Dillon
What we wound up doing was splitting tvtohz() into two functions.

tvtohz_high(tv)

Returned value meets or exceeds requested time.  A minimum value
of 1 is returned (really only for {0,0}.. else minimum value is 2).

tvtohz_low(tv)

Returned value might be shorter then requested time, and 0 can
be returned.

Most kernel functions use the tvtohz_high() function.  Only a few
use tvtohz_low().

I have not found any 'good' solution to the problem.  For example,
average-up errors can mount up when using the results to control a
callout timer resulting in much longer delays then originally intended,
and similarly same-tick interrupts (e.g. a value of 1) can create
much shorter delays then expected.  Sometimes one cares more about
the average interval being correct, other times the time must not
be allowed to be too short.  You lose no matter what you choose.

http://fxr.watson.org/fxr/source/kern/kern_clock.c?v=DFBSD

If you look at tvtohz_high() you will note that the minimum value
of 1 is only returned if the passed tv is essentially {0,0}.  i.e. 0uS.
1uS == 2 ticks (((us + (tick - 1)) / tick) + 1).  The 'tick' global
here is the number of uS per tick (not to be confused with 'ticks').

Because of all of that I decided to split the function to make the
requirements more apparent.

--

The nanosleep() work is a different issue... that's for userland calls
(primarily the libc usleep() function).  We found that some linux
programs assumed that nanosleep() was far more fine-grained then (hz)
and, anyway, the system call is called 'nanosleep' and 'usleep' which
kind of implies a fine-grained sleep, so we turned it into one when
small time intervals were being requested.

http://fxr.watson.org/fxr/source/kern/kern_time.c?v=DFBSD

The way I figure it if a userland program wants to make system calls
with fine-grained sleeps that are too small, it's really no different
from treating that program as being cpu-bound anyway so why not try to
accomodate it?

--

The 8254 issue is more one of a lack of interest in fixing it.
Basically using the 8254 as a measure of realtime when the reload
value is set to small (i.e. high hz) will always lead to serious
timing problems.  The reason there is such a lack of interest
in fixing it is that most machines have other timers available
(lapic, acpi, hpet, tsc, etc).  A secondary issue might be tying
real-time functions to 'ticks', which could still be driven by the
8254 interrupt those have to be divorced from ticks.  I'm not
sure if FreeBSD has any of those left (does date still skip quickly if
hz is set ultra-high?  Even when other timers are available?).

I will note that tying real-time functions to the hz-based tick
function (which is also the 8254-driven problem when other timers
are not available) leads to serious problems, particularly with ntpd,
even if you only lose track of the full cycle of the timer
occassionally.

However, neither do you want to 'skip' the ticks value to catch up
to a lost interrupt.  That will mess up tsleep() and other hz-based
timeouts that assume that values of '2' will not instantly
timeout.

So actual realtime operations really do have to be completely divorced
from the hz-based ticks counter and it must only be used for looser
timing needs such as protocol timeouts and sleeps.

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS root File System

2009-02-27 Thread Matthew Dillon
My experience with one of our people trying to do the same thing 
w/ HAMMER... we got it working, but it is not necessarily cleaner.

I'd rather just boot from a small UFS /boot partition on 'a' (256M
or 512M), followed by swap on 'b', followed by the big-ass root
partition on 'd' using your favorite filesystem.

The boot code already pretty much handles this state of affairs, one only
needs:

(1) To partition it this way.

(2) Add line to /boot/loader.conf pointing the kernel at the actual root,
e.g. (in my case):

vfs.root.mountfrom=hammer:ad6s1d

(3) Adjust sysctl kern.bootfile in e.g. /etc/sysctl.conf.  Since the
boot loader thinks the kernel is on / instead of /boot (because
/boot is the root from the point of view of the bootloader),
it might set this to /kernel instead of /boot/kernel.  So
you may have to override it to make crash dumps and name lists
work properly.

(4) Add a mount for the little /boot partition in /etc/fstab.

Trying to create one large root on 'a' puts the default spot for swap
on 'b' at the end of the disk instead of near the beginning.  The end
of the disk (closer to the spindle) is a bad place for swap.  Having
a small /boot partition there instead retains the ordering and puts the
swap where it is expected to be.

# df
Filesystem 1K-blocks Used Avail Capacity  Mounted on
/dev/ad6s1d193888256  1662976 192225280 1%/
/dev/ad6s1a   257998   11089612646447%/boot

--

In anycase, if RAID is an issue the loader could always be adjusted to
look for a boot partition on multiple disks.  One could then have a /boot
on two independant disks, or even operate it as a soft-raid-mirror.  It
seems less of an issue these days since someone with that sort of
requirement who isn't already net-booting can just pop in a SSD for
booting which will have approximately the same or better MTBF as the
motherboard electronics.

The problem we face with HAMMER is related to the boot loader not being
able to run the UNDO buffer (yet), so it might not be able to find
the kernel after a crash.  That and the inconvenient place swap ends up
at.

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: sidetrack [was Re: 'at now' not working as expected]

2008-10-09 Thread Matthew Dillon
Also, if you happen to have a handheld GPS unit, it almost certainly
has a menu option to tell you the sunrise and sunset times at your
current position.

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: UNEXPECTED SOFT UPDATE INCONSISTENCY; RUN fsck MANUALLY

2008-09-30 Thread Matthew Dillon
 background fsck problematic
and risky.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Would anybody port DragonFlyBSD's HAMMER fs to FreeBSD?

2008-09-30 Thread Matthew Dillon
Guys, please don't start a flamewar.  And lhmwzy we discussed this
on the DFly lists.  It's really up to them... that is, a programmer
who has an interest, inclination, and time.  It isn't really fair to
try to push it.

I personally believe that the FreeBSD community as a whole should
focus on ZFS for now.  It has the momentum and the most interest
on their lists.

-Matt
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: UNEXPECTED SOFT UPDATE INCONSISTENCY; RUN fsck MANUALLY

2008-09-29 Thread Matthew Dillon
A couple of things to note here.  Well, many things actually.

* Turning off write caching, assuming the drive even looks at the bit,
  will destroy write performance for any driver which does not support
  command queueing.  So, for example, scsi typically has command
  queueing (as long as the underlying drive firmware actually implements
  it properly), 3Ware cards have it (underlying drives, if SATA, may not,
  but 3Ware's firmware itself might do the right thing). 

  The FreeBSD ATA driver does not, not even in AHCI mode.  The RAID
  code does not as far as I can tell.  You don't want to turn this off.

* Filesystems like ZFS and HAMMER make no assumptions on write
  ordering to disk for completed write I/O vs future write I/O
  and use BIO_FLUSH to enforce ordering on-disk.  These filesystems
  are able to queue up large numbers of parallel writes inbetween
  each BIO_FLUSH, so the flush operation has only a very small 
  effect on actual performance.

  Numerous Linux filesystems also use the flush command and do
  not make assumptions on BIO-completion/future-BIO ordering.

* UFS + softupdates assumes write ordering between completed BIO's
  and future BIOs.  This doesn't hold true on a modern drive (with
  write caching turned on).  Unfortunately it is ALSO not really
  the cause behind most of the inconsistency reports.

  UFS was *never* designed to deal with disk flushing.  Softupdates
  was never designed with a BIO_FLUSH command in mind.  They were
  designed for formally ordered I/O (bowrite) which fell out of
  favor about a decade ago and has since been removed from most 
  operating systems.

* Don't get stuck in a rut and blame DMA/Drive/firmware for all the
  troubles.  It just doesn't happen often enough to even come close
  to being responsible for the number of bug reports.

With some work UFS can be modified to do it, but performance will
probably degrade considerably because the only way to do it is to
hold the completed write BIOs (not biodone() them) until something
gets stuck, or enough build up, then issue a BIO_FLUSH and, after
it returns, finish completing the BIOs (call the biodone()) for the
prior write I/Os.  This will cause softupdates to work properly.
Softupdates orders I/O's based on BIO completions. 

Another option would be to complete the BIOs but do major surgery on
softupdates itself to mark the dependancies as waiting for a flush,
then flush proactively and re-sync.

Unfortunately, this will not solve the whole problem.  IF THE DRIVE
DOESN'T LOOSE POWER IT WILL FLUSH THE BIOs IT SAID WERE COMPLETED.
In otherwords, unless you have an actual power failure the assumptions
softupdates will hold.  A kernel crash does NOT prevent the actual
drive from flushing the IOs in its cache.  The disk can wind up with
unexpected softupdates inconsistencies on reboot anyway.  Thus the
source of most of the inconsistency reports will not be fixed by adding
this feature.  So more work is needed on top of that.

--

Nearly ALL of the unexpected softupdates inconsistencies you see *ARE*
for the case where the drive DOES in fact get all the BIO data it
returned as completed onto the disk media.  This has happened to me
many, many times with UFS.  I'm repeating this:  Short of an actual
power failure, any I/O's sent to and acknowledged by the drive are
flushed to the media before the drive resets.  A FreeBSD crash does
not magically prevent the drive from flushing out its internal queues.

This means that there are bugs in softupdates  the kernel which can
result in unexpected inconsistencies on reboot.  Nobody has ever
life-tested softupdates to try to locate and fix the issues.  Though I
do occassionally see commits that try to fix various issues, they tend
to be more for live-side non-crash cases then for crash cases.

Some easy areas which can be worked on:

* Don't flush the buffer cache on a crash.   Some of you already do this
  for other reasons (it makes it more likely that you can get a crash
  dump).

  The kernel's flushing of the buffer cache is likely a cause of a
  good chunk of the inconsitency reports by fsck, because unless
  someone worked on the buffer flushing code it likely bypasses
  softupdates.  I know when working on HAMMER I had to add a bioop
  explicitly to allow the kernel flush-buffers-on-crash code to query
  whether it was actually ok to flush a dirty buffer or not.  Until I
  did that DragonFly was flushing HAMMER buffers which on crash which
  it had absolutely no business flushing.

* Implement active dependancy flushing in softupdates.  Instead of
  having it just adjust the dependancies for later flushes softupdates
  needs to actively initiate I/O for 

Re: UNEXPECTED SOFT UPDATE INCONSISTENCY; RUN fsck MANUALLY

2008-09-29 Thread Matthew Dillon

:Completely agree.  ZFS is the way of the future for FreeBSD.  In my
:latest testing, the memory problems are now under control, there is just
:stability problems with random lockups after days of heavy load unless I
:turn off ZIL.  So its nearly there.
:
:If only ZFS also supported a network distributed mode.  Or can we
:convince you to port Hammer to FreeBSD? :-)
:
:- Andrew

Heh.  No, you guys would have to port it if you want it, though I would
be happy to support it once ported.  Issues are between minor and
moderate but would still require a knowledgeable filesystem person 
to do.  Biggest issues will be buffer cache and bioops differences,
and differences in the namespace VOPs.

--

But, IMHO, you guys should focus on ZFS since clearly a lot of work has
gone into its port, it works now in FBSD, and it just needs to be made
production-ready and a little more programming support from the
community.  It also has a different feature set then HAMMER.  P.S.
FreeBSD should issue a $$ grant or stipend to Pawel for that work,
he's really saving your asses.  UFS has clearly reached its end-of-life.

Speaking of ZFS, you guys probably went through the same boot issues
that we are going through with HAMMER.  I came up with a solution which
turned out to be quite non-invasive and very easy to implement.

* Make a small /boot UFS partition.  e.g. 256M ad0s1a.

* Put HAMMER (or ZFS in your case) on the rest of the disk (ad0s1d).

* Adjust the loader to search both / and /boot, so /boot can be its own
  partition or a sub-directory on root.

* Add a simple line to /boot/loader.conf to direct the kernel to the
  proper root, e.g.

  vfs.root.mountfrom=hammer:ad0s1d

And poof, you're done.  Then when the system boots it boots into a
HAMMER (ZFS) root, and /boot is mounted as small UFS filesystem under
it.

Miscellanious other partitions would then be pseudo-fs's under the
single HAMMER (or ZFS) root, removing the need to worry about reserving
particular amounts of space, and providing the needed backup and
snapshot separation domains.

Well, you guys might have already solved it.  There isn't much to it.

I recall there was quite a discussion on trying to create redundant
boot setup on FreeBSD, such as boot-to-RAID setups, and having trouble
getting the BIOS to recognize it.  There's an alternative solution...
having a separate, small /boot means you can boot from a small solid
state storage device whos MTBF is going to be the same as the PC
hardware itself.  No real storage redundancy is needed and if your root
is somewhere else that gives you the option of putting more
sophisticated code in /boot (it being the full kernel) to properly
mount the root.  I have 0 (ZERO) trust in BIOS-RAID or card-supported
RAID-enabled (such as with 3Ware) boot support.  ZERO.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: bad NFS/UDP performance

2008-09-27 Thread Matthew Dillon
:how can I see the IP fragment reassembly statistics?
:
:thanks,
:   danny

netstat -s

Also look for unexpected dropped packets, dropped fragments, and
errors during the test and such, they are counted in the statistics
as well.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: bad NFS/UDP performance

2008-09-26 Thread Matthew Dillon

:  -vfs.nfs.realign_test: 22141777
:  +vfs.nfs.realign_test: 498351
: 
:  -vfs.nfsrv.realign_test: 5005908
:  +vfs.nfsrv.realign_test: 0
: 
:  +vfs.nfsrv.commit_miss: 0
:  +vfs.nfsrv.commit_blks: 0
: 
: changing them did nothing - or at least with respect to nfs throughput :-)
:
:I'm not sure what any of these do, as NFS is a bit out of my league.
::-)  I'll be following this thread though!
:
:-- 
:| Jeremy Chadwickjdc at parodius.com |

A non-zero nfs_realign_count is bad, it means NFS had to copy the
mbuf chain to fix the alignment.  nfs_realign_test is just the
number of times it checked.  So nfs_realign_test is irrelevant.
it's nfs_realign_count that matters.

Several things can cause NFS payloads to be improperly aligned.
Anything from older network drivers which can't start DMA on a 
2-byte boundary, resulting in the 14-byte encapsulation header 
causing improper alignment of the IP header  payload, to rpc
embedded in NFS TCP streams winding up being misaligned.

Modern network hardware either support 2-byte-aligned DMA, allowing
the encapsulation to be 2-byte aligned so the payload winds up being
4-byte aligned, or support DMA chaining allowing the payload to be
placed in its own mbuf, or pad, etc.

--

One thing I would check is to be sure a couple of nfsiod's are running
on the client when doing your tests.  If none are running the RPCs wind
up being more synchronous and less pipelined.  Another thing I would
check is IP fragment reassembly statistics (for UDP) - there should be
none for TCP connections no matter what the NFS I/O size selected.

(It does seem more likely to be scheduler-related, though).

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Max size of one swap slice

2008-08-06 Thread Matthew Dillon

:
:See
:
:http://www.freebsd.org/cgi/getmsg.cgi?fetch=540837+0+/usr/local/www/db/text/2008/freebsd-questions/20080706.freebsd-questions
:
:Kris

Hmm.  I see an issue that FreeBSD could correct to reduce wired
memory use by the swap system.

Your sys/blist.h has this:

typedef u_int32_t   u_daddr_t;  /* unsigned disk address */

and your sys/types.h has this:

typedef int64_t   daddr_t;  /* unsigned disk address */

sys/blist.h really assumes a 32 bit daddr_t.  It's amazing the code
even still works with daddr_t at 64 bits and u_daddr_t at 32 bits.

Changing that whole mess in sys/blist.h to a different typedef name,
say swblk_t (which is already defined to be 32 bits), and renaming
u_daddr_t to u_swblk_t, plus also changing the swblock structure
in vm/swap_pager.c to use a 32 bit array elements instead of 64 bit
array elements will cut the size of struct swblock almost in half.

There is no real need for swap block addressing  32 bits.  32 bits
gives you swap in the terrabyte range.

struct swblock {
struct swblock  *swb_hnext;
vm_object_t swb_object;
vm_pindex_t swb_index;
int swb_count;
daddr_t swb_pages[SWAP_META_PAGES];  this array
};

Any arithmatic accessing the elements would also have to be vetted
for any necessary promotions.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Max size of one swap slice

2008-08-05 Thread Matthew Dillon
: Recently we found that we can only allocate 32GB for one swap slice.
: Does there is any sysctl oid  or any kernel option to increase it? Why
: we have this restriction?
:
:this is a consequence of the data structure used to manage swap space.  See 
:sys/blist.h for details.  It *seems* that you *might* be able to increase the 
:coverage by decreasing BLIST_META_RADIX, but that's from a quick glance and 
:most certainly not a good idea.
:
:However, the blist is a abstract enough API so that you can likely replace it 
:with something that supports 64bit addresses (and thus 512*2^64 bytes of swap 
:space per device) ... but I don't see why you'd want to do something like 
:this.  Remember that you need memory to manage your swap space as well!
:
:-- 
:/\  Best regards,  | [EMAIL PROTECTED]
:\ /  Max Laier  | ICQ #67774661

The core structures can handle 2 billion swap pages == 2TB of swap,
but the blist code hits arithmatic overflows if a single blist has
more then (0x4000 / BLIST_META_RADIX) = 1G/16 = 64M swap blocks,
or 256GB.

I think the VM/BIO system had additional overflow issues due to
conversions back and forth between PAGE_SIZE and DEV_BSIZE which
further restricted the limit to 32GB.  Those restrictions may be gone
now that FreeBSD is using 64 bit block numbers, so you may be able to
pop it up to 256GB with virtually no effort (but you need to test it
significantly!).

With some work on the blist code only (not its structures) the arithmatic
overflow issues could also be resolved, increasing the swap capability
to 2TB.

I do not recommend changing any of the core blist structure, particularly
not BLIST_META_RADIX.  Just don't try :-).  You do NOT want to bump
the swap block number fields to 64 bits.

Also note that significant memory is used to manage that much swap.  It's
a factor of 1:16384 or so for the blist structures and probably about
the same amount for the vm_object tracking structures.  32G of swap needs
around 2-4MB of wired ram.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: taskqueue timeout

2008-07-15 Thread Matthew Dillon

:Hi everyone,
:
:I'm wondering if the problems described in the following link have been 
:resolved:
:
:http://unix.derkeiler.com/Mailing-Lists/FreeBSD/stable/2008-02/msg00211.html
:
:I've got four 500GB SATA disks in a ZFS raidz pool, and all four of them 
:are experiencing the behavior.
:
:The problem only happens with extreme disk activity. The box becomes 
:unresponsive (can not SSH etc). Keyboard input is displayed on the 
:console, but the commands are not accepted.
:
:Is there anything I can do to either figure this out, or work around it?
:
:Steve

If you are getting DMA timeouts, go to this URL:

http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting

Then I would suggest going into /usr/src/sys/dev/ata (I think, on
FreeBSD), locate all instances where request-timeout is set to 5,
and change them all to 10.

cd /usr/src/sys/dev/ata
fgrep 'request-timeout' *.c
... change all assignments of 5 to 10 ...

Try that first.  If it helps then it is a known issue.  Basically
a combination of the on-disk write cache and possible ECC corrections,
remappings, or excessive remapped sectors can cause the drive to take
much longer then normal to complete a request.  The default 5-second
timeout is insufficient.

If it does help, post confirmation to prod the FBsd developers to
change the timeouts.

--

If you are NOT getting DMA timeouts then the ZFS lockups may be due
to buffer/memory deadlocks.  ZFS has knobs for adjusting its memory
footprint size.  Lowering the footprint ought to solve (most of) those
issues.  It's actually somewhat of a hard issue to solve.  Filesystems
like UFS aren't complex enough to require the sort of dynamic memory
allocations deep in the filesystem that ZFS and HAMMER need to do.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Multi-machine mirroring choices

2008-07-15 Thread Matthew Dillon

:Oliver Fromme wrote:
:
: Yet another way would be to use DragoFly's Hammer file
: system which is part of DragonFly BSD 2.0 which will be
: released in a few days.  It supports remote mirroring,
: i.e. mirror source and mirror target can run on different
: machines.  Of course it is still very new and experimental
: (however, ZFS is marked experimental, too), so you probably
: don't want to use it on critical production machines.
:
:Let's not get carried away here :)
:
:Kris

Heh.  I think its safe to say that a *NATIVE* uninterrupted and fully
cache coherent fail-over feature is not something any of us in BSDland
have yet.  It's a damn difficult problem that is frankly best solved
above the filesytem layer, but with filesystem support for bulk mirroring
operations.

HAMMER's native mirroring was the last major feature to go into
it before the upcoming release, so it will definitely be more
experimental then the rest of HAMMER.  This is mainly because it
implements a full blown queue-less incremental snapshot and mirroring
algorithm, single-master-to-multi-slave.  It does it at a very low level,
by optimally scanning HAMMER's B-Tree.  In other words, the kitchen
sink.

The B-Tree propagates the highest transaction id up to the root to
support incremental mirroring and that's the bit that is highly
experimental and not well tested yet.  It's fairly complex because
even destroyed B-Tree records and collapses must propagate a
transaction id up the tree (so the mirroring code knows what it needs
to send to the other end to do comparative deletions on the target).

(transaction ids are bundled together in larger flushes so the actual
B-Tree overhead is minimal).

The rest of HAMMER is shaping up very well for the release.  It's
phenominal when it comes to storing backups.  Post-release I'll be
moving more of our production systems to HAMMER.  The only sticky
issue we have is filesystem-full handling, but it is more a matter
of fine-tuning then anything else.

--

Someone mentioned atime and mtime.  For something like ZFS or HAMMER,
these fields represent a real problem (atime more then mtime).  I'm
kinda interested in knowing, does ZFS do block replacement for
atime updates?

For HAMMER I don't roll new B-Tree records for atime or mtime updates.
I update the fields in-place in the current version of the inode and
all snapshot accesses will lock them (in getattr) to ctime in order to
guarantee a consistent result.  That way (tar | md5) can be used to
validate snapshot integrity.

At the moment, in this first release, the mirroring code does not
propagate atime or mtime.  I plan to do it, though.  Even though
I don't roll new B-Tree records for atime/mtime updates I can still
propagate a new transaction id up the B-Tree to make the changes
visible to the mirroring code.  I'll definitely be doing that for mtime
and will have the option to do it for atime as well.  But atime still
represents a big expense in actual mirroring bandwidth.  If someone
reads a million files on the master then a million inode records (sans
file contents) would end up in the mirroring stream just for the atime
update.  Ick.

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: taskqueue timeout

2008-07-15 Thread Matthew Dillon
:Went from 10-15, and it took quite a bit longer into the backup before 
:the problem cropped back up.

Try 30 or longer.  See if you can make the problem go away entirely.
then fall back to 5 and see if the problem resumes at its earlier
pace.

--

It could be temperature related.  The drives are being exercised
a lot, they could very well be overheating.  To find out add more
airflow (a big house fan would do the trick).

--

It could be that errors are accumulating on the drives, but it seems
unlikely that four drives would exhibit the same problem.

--

Also make sure the power supply can handle four drives.  Most power
supplies that come with consumer boxes can't under full load if you
also have a mid or high-end graphics card installed.  Power supplies
that come with OEM slap-together enclosures are not usually much better.

Specifically, look at the +5V and +12V amperage maximums on the power
supply, then check the disk labels to see what they draw, then
multiply by 2.  e.g. if your power supply can do [EMAIL PROTECTED] and you 
have
four drives each taking [EMAIL PROTECTED] (and typically ~half that at 5V), 
thats
4x2x2 = [EMAIL PROTECTED] and you would probably be ok.

To test, remove two of the four drives, reformat the ZFS to use just 2,
and see if the problem reoccurs with just two drives.

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: taskqueue timeout

2008-07-15 Thread Matthew Dillon

:...
: and see if the problem reoccurs with just two drives.
:
:... I knew that was going to come up... my response is I worked so hard 
:to get this system with ZFS all configured *exactly* how I wanted it.
:
:To test, I'm going to flip to 30 as per Matthews recommendation, and see 
:how far that takes me. At this time, I'm only testing by backing up one 
:machine on the network. If it fails, I'll clock the time, and then 
:'reformat' with two drives.
:
:Is there a technical reason this may work better with only two drives?
:
:Is there anyone interested to the point where remote login would be helpful?
:
:Steve

This issue is vexing a lot of people.

Setting the timeout to 30 will not effect performance, but it will
cause a 30 second delay in recovery when (if) the problem occurs.
i.e. when the disk stalls it will just sit there doing nothing for
30 seconds, then it will print the timeout message and try to recover.

It occurs to me that it might be beneficial to actually measure the
disk's response time to each request, and then graph it over a period
of time.  Maybe seeing the issue visually will give some clue as to the
actual cause.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: dvd dma problems

2008-07-14 Thread Matthew Dillon

: quite fine from 5.3 to somewhere in the 6.x branch. Nowadays I have to send
: them to PIO4 to play DVDs, because they'll just throw DMA not aligned errors
: around in UDMA33 or WDMA2 mode.
: 
: Should someone be interested in this I'm willing to supply all necessary
: information, such as the exact drives, firmware versions, kernel traces...
: whatever comes to your mind. I'm also willing to test patches.
:
:Is the problem you're seeing identical to this?
:
:http://lists.freebsd.org/pipermail/freebsd-hackers/2008-July/025297.html
:
:-- 
:| Jeremy Chadwickjdc at parodius.com |

One of our guys (in DragonFly-land) tracked this down two two issues,
fixing either one will fix the problem.  I'd include a patch but he
has not finished it yet.  Still, anyone with moderate kernel
programming skills can probably fix it in an hour or less.

physio() - uses vmapbuf().  vmapbuf() does NOT realign the user address,
it simply maps it into the buffer and adjusts b_data.  So if the
user supplies a badly aligned buffer, physio() will happily pass that
bad alignment to the driver.

physio() could be modified to allocate kernel memory to back the pbuf
and copy instead of calling vmapbuf(), for those cases where the user
supplied buffer is not well aligned (e.g. not at least 512-byte aligned).
The pbuf already reserve KVA so all one would need to do is allocate
pages to back the KVA space.  I think a couple of other subsystems in
the kernel do this with pbufs so there is plenty of example material.

--

The ATA driver has an explicit alignment check and also uses
BUS_DMA_NOWAIT in its call to bus_dmamap_load() in ata_dmaload().

The ATA driver could be adjusted to remove the alignment check,
remove the BUS_DMA_NOWAIT flag, and also not free the bounce buffer
when DMA ends (so you don't get allocator deadlocks).  You might have
other issues related to lock ordering, and this solution would eat
a considerable amount of memory (upwards of a megabyte, more if you have
more ATA channels), but that's the jist of it.

It should be noted that only physio() can supply unaligned BIOs to the
driver layer.  All other BIO sources (that I know of) will always be
at least 512-byte aligned.

--

My recommendation is to fix physio().  User programs that do not supply
aligned buffers clearly don't care about performance, so the kernel
can just back the pbuf with memory and copyin/out the user data.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Performance of madvise / msync

2008-06-27 Thread Matthew Dillon
:With madvise() and without msync(), there are high numbers of
:faults, which matches the number of disk io operations.  It
:goes through cycles, every once in a while stalling while about
:60MB of data is dumped to disk at 20MB/s or so (buffers flushing?)
:At the beginning of each cycle it's fast, with 140 faults/s or so,
:and slows as the number of faults climbs to 180/s or so before
:stalling and flusing again.  It never gets _really_ slow though.

Yah, without the msync() the dirty pages build up in the kernel's
VM page cache.  A flush should happen automatically every 30-60
seconds, or sooner if the buffer cache builds up too many dirty pages.

The activity you are seeing sounds like the 30-60 second filesystem
sync the kernel does periodically.

Either NetBSD or OpenBSD, I forget which, implemented a partial sync
feature to prevent long stalls when the filesystem syncer hits a file
with a lot of dirty pages.  FreeBSD could borrow that optimization if
they want to reduce stalls from the filesytem sync.  I ported it to DFly
a while back myself.

:With msync() and without madvise(), things are very slow, and
:there are no faults, just writes.
:...
:  The size_t argument to msync() (0x453b7618) is highly questionable.
:  It could be ktrace reporting the wrong value, but maybe not.
:
:That's the size of rg2.rrd.  It's 1161524760 bytes long.
:...
:Looks like the source of my problem is very slow msync() on the
:file when the file is over a certain size.  It's still fastest
:without either madvise or msync.
:
:Thanks for your time,
:
:Marcus

The msync() is clearly the problem.  There are numerous optimizations
in the kernel but msync() is frankly a rather nasty critter even with
the optimizations work.  Nobody using msync() in real life ever tries
to run it over the entirety of such a large mapping... usually it is
just run on explicit sub-ranges that the program wishes to sync.

One reason why msync() is so nasty is that the kernel must physically
check the page table(s) to determine whether a page has been marked dirty
by the MMU, so it can't just iterate the pages it knows are dirty in
the VM object.  It's nasty whether it scans the VM object and iterates
the page tables, or scans the page tables and looks up the related VM
pages.   The only way to optimize this is to force write-faults by
mapping clean pages read-only, in order to track whether a page is
actually dirty in real time instead of lazily.  Then msync() would
only have to do a ranged-scan of the VM object's dirty-page list
and would not have to actually check the page tables for clean pages.

A secondary effect of the msync() is that it is initiating asynchronous
I/O for what sounds like hundreds of VM pages, or even more.  All those
pages are locked and busied from the point they are queued to the point
the I/O finishes, which for some of the pages can be a very, very long
time (into the multiples of seconds).  Pages locked that long will
interfere with madvise() calls made after the msync(), and probably
even interfere with the follow msync().

It used to be that msync() only synced VM pages to the underlying
file, making them consistent with read()'s and write()'s against
the underlying file.  Since FreeBSD uses a unified VM page cache
this is always true.  However, the Open Group specification now
requires that the dirty pages actually be written out to the underlying
media... i.e. issue real I/O.  So msync() can't be a NOP if you go by
the OpenGroup specification.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Performance of madvise / msync

2008-06-26 Thread Matthew Dillon
:   65074 python   0.06 CALL  madvise(0x287c5000,0x70,_MADV_WILLNEED)
:   65074 python   0.027455 RET   madvise 0
:   65074 python   0.58 CALL  madvise(0x287c5000,0x1c20,_MADV_WILLNEED)
:   65074 python   0.016904 RET   madvise 0
:   65074 python   0.000179 CALL  madvise(0x287c6000,0x1950,_MADV_WILLNEED)
:   65074 python   0.008629 RET   madvise 0
:   65074 python   0.40 CALL  madvise(0x287c8000,0x8,_MADV_WILLNEED)
:   65074 python   0.004173 RET   madvise 0
:...
:   65074 python   0.006084 CALL  msync(0x287c5000,0x453b7618,MS_ASYNC)
:   65074 python   0.106284 RET   msync 0
:...
:As you can see, it's quite a bit faster.
:
:I know that msync is necessary under Linux but obsolete under FreeBSD, but
:it's still funny that it takes a tenth of a second to return even with
:MS_ASYNC specified.
:
:Also, why is it that the madvise() calls take so much longer when the
:program does a couple of its own madvise() calls?  Was madvise() never
:intended to be run so frequently and is therefore a little slower than
:it could be?
:
:Here's the diff between the code for the first kdump above and the
:second one.

 Those times are way way too large, even with other running threads
 in the system.  madvise() should not take that long unless it is
 being forced to wait on a busied page, and neither should msync().
 madvise() doesn't even do any I/O (or shouldn't anyhow).

 Try removing just the msync() but keep the madvise() calls and see
 if the madvise() calls continue to take horrendous amounts of time.
 Then try the vise-versa.

 It kinda feels like a prior msync() is initiating physical I/O on
 pages and a later mmap/madvise or page fault is being forced to
 wait on the prior pages for the I/O to finish.

 The size_t argument to msync() (0x453b7618) is highly questionable.
 It could be ktrace reporting the wrong value, but maybe not.
 On any sort of random writing test, particularly if multiple threads
 are involved, specifying a size that large could result in very large
 latencies.

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Sysctl knob(s) to set TCP 'nagle' time-out?

2008-06-23 Thread Matthew Dillon

:Hi,
:
:I'm wondering if anything exists to set this.. When you create an INET  
:socket
:without the 'TCP_NODELAY' flag the network layer does 'naggling' on your
:transmitted data. Sometimes with hosts that use Delayed_ACK  
:(net.inet.tcp.
:delayed_ack) it creates a dead-lock where the host will not ACK until  
:it gets
:another packet and the client will not send another packet until it  
:gets an ACK..
:
:The dead-lock gets broken by a time-out, which I think is around 200ms?
:
:But I would like to change that time-out if possible to something  
:lower, yet
:I can't really see any sysctl knobs that have a name that suggests  
:they do
:that..
:
:So does anyone know IF this can be tuned and if so by what?
:
:Cheers,
:Jerahmy.
:
:(And yes you could solve it by setting the TCP_NODELAY flag on the  
:socket,
:but not everything has programmed in options to set it and you don't  
:always
:have access to the source, besides setting a sysctl value would be much
:simpler than recompiling stuff)

There is a sysctl which adjusts the delayed-ack timing, its
called net.inet.tcp.delacktime.  The default is 1/10 of a second
(100 == 100 ms = 1/10 of a second).

BUT, it shouldn't be possible for nagle to deadlock against delayed acks
unless the TCP implementation is broken somehow.  A delayed ack is
simply that... the ack is delayed 100 ms in order to improve its
chances of being piggy-backed on return data.  The ack is not blocked
completely, just delayed, and certain events (such as the receiving
end turning around and sending data back, which is typical for an
interactive connection)... certain events will cause the delayed ack
to be aborted and for the ack to be immediately sent with the return data.

Can it break down and cause excessive lag?  Yes, it can.  Interactive
games almost universally have to disable Nagle because the lag is
actually due to the data relay from client 1 - server then relaying
the interactive event to client 2.  Without an immediate interactive
response to client 1 the ack gets delayed and the next event from 
client 1 hits Nagle and stops dead in the water until the first event
reaches client 2 and client 2 reacts to it (then client 2 - server - 
(abort delayed ack and send) - client 1 (client 1's nagle now allows
the second event to be transmitted).  That isn't a deadlock, just 
really poor interactive performance in that particular situation.

Delayed acks also have a safety valve.  The spec says that an ack
cannot be delayed more then two packets.  In a batch link when the
second (unacked) packet is received, the delayed ack is aborted and
an ack is immediately returned to the sender.  This is to prevent
congestion control (which is based on acks) from getting completely
out of whack and also to prevent the TCP window from getting exhausted.

In anycase, the usual solution is to disable Nagle rather then mess
with delayed acks.  What we need is a new Nagle that understands the
new reality for interactive connections... something that doesn't break
performance in the 'server in the middle' data relaying case.

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Sysctl knob(s) to set TCP 'nagle' time-out?

2008-06-23 Thread Matthew Dillon

:One possibility I see is a statistic about DelACKs per TCP connection,
:counting those that were rightfully delayed (with hindsight). I.e.,
:if an ACK is delayed, but there was no chance to piggy-back it or to
:combine it with another ACK, it could have been sent without delay.
:Only those delayed ACKs that reduce load are good, all others cause
:additional state to be maintained and may increase latencies for no
:good reason.
:
:...
:consideration. And to me, automatic setting of TCP_NODELAY seems
:more useful than automatic clearing (after delayed ACKs had been
:found to be of no use for a window of say 8 or 16 ACKs).
:
:The implementation would be quite simple: Whenever a delayed ACK
:is sent, check whether it is sent on its own (bad) or whether it
:could be piggy-backed (good). If, say, 7 of 8 delayed ACKs had to
:be sent as ACK-only packets, anyway, set TCP_NODELAY and do not
:bother to keep on deciding whether delayed ACKs had become useful
:in a different phase of the communication. If you want to be able
:to automatically disable TCP_NODELAY, then just set a time-stamp
:...
:Regards, STefan

That's an interesting approach.  I think it would catch some
of the cases, but not enough of them.  If the round-trip in
the server-relaying case is less then the delayed-ack, the acks
will still wind up piggy-backed on return traffic but the latency
will also still remain horrible.

It should be noted that Nagle can cause high latencies even when
delayed acks are turned off.  Nagle's delay is not timed... in its
simplest description it prevents packets from being transmitted
for new data coming from userland if the data already in the
sockbuf (and presumably already transmitted) has not yet been
acknowledged.

For interactive traffic this means that Nagle is putting the screws
on the packet stream even if the acks aren't delayed, simply from the
ack latency.  With delayed acks turned off the latency is lower, but
not 0, so interactive traffic is still being held up by Nagle.  The
effect is noticeable even on a LAN.  Jerahmy brought up Samba... that
is an excellent example.  NFS-over-TCP would be another good example.

Any protocol which multiplexes multiple commands from different
sources over the same connection gets really messed up (slowed down)
by Nagle.

On the flip side, Nagle can't just be turned off by default because
it would cause streaming connections from user programs which do tiny
writes to generate a lot of unnecessarily tiny packets.  This can become
apparent when using SSH over a slow link.  Numerous programs run from
a shell generate fairly ineffcient packets which could have easily
been batched when operating over SSH.  The result can be sludgy
performance for output which ought be batched up by TCP but isn't because
SSH turns off Nagle unconditionally.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Sockets stuck in FIN_WAIT_1

2008-05-30 Thread Matthew Dillon

:Yes, IPFW is running on the box.  Why not?
:
:-- 
:Robert Blayzor, BOFH
:INOC, LLC
:[EMAIL PROTECTED]
:http://www.inoc.net/~rblayzor/

There's nothing wrong with running IPFW on the same box :-)

But, I think that rule change is masking the problem rather then solving
it.  The keep-state is limited.  The reason the number of dead connections
isn't going up is probably because IPFW is either hitting its keep-state
limit and dropping connections, or the connection becomes idle long 
enough for IPFW to recycle the keep-state for it, also causing it to
drop.

Once the keep-state is lost that deny established rule will cause the
connection to fail.

I would be very careful with any type of ruleset (IPFW or PF) which
relies on keep-state.  You can wind up causing legitimate connections
to drop if it isn't carefully tuned.

It might be a reasonable bandaid, though.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Sockets stuck in FIN_WAIT_1

2008-05-29 Thread Matthew Dillon
I guess nobody mentioned the obvious thing to check:  Make sure
TCP keepalive is turned on.

sysctl net.inet.tcp.always_keepalive=1

If you don't do this then dead TCP connections can build up, particularly
on busy servers, due to the other end simply disappearing.

Without this option the TCP protocol can get stuck, because it does not
usually send packets to the other end of an idle connection unless 
(1) its window has closed completely or (2) it has unackncowledged data
or state pending.  The keepalive forces a probe to occur every so often
on an idle connection (like once every 30min-2hrs, I forget what the
default is), to check that the connection still exists.

It is possible to get stuck during normal data operation and while in
a half-closed state.  The 2MSL timeout does not activate until you
go into a fully closed state (FIN2/TIME_WAIT).

Pretty much if you are running any sort of service on the internet,
and even if you aren't, you need to make sure keepalive is turned on
for the long term health of your system.

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Sockets stuck in FIN_WAIT_1

2008-05-29 Thread Matthew Dillon

:On May 29, 2008, at 3:12 PM, Matthew Dillon wrote:
:I guess nobody mentioned the obvious thing to check:  Make sure
:TCP keepalive is turned on.
:
:sysctl net.inet.tcp.always_keepalive=1
:
:
:Thanks Matt.
:
:I also thought that a keepalives were not running and sessions just  
:stuck around forever, however I do have:
:
:
:net.inet.tcp.keepidle=90
:net.inet.tcp.keepintvl=3
:net.inet.tcp.msl=5000
:net.inet.tcp.always_keepalive=1  (default)
:
:
:I believe keep idle was defaulted to 2hrs, I changed it to 15 minutes  
:with a 30 second tick... I still found FIN_WAIT_1 sessions stuck for  
:several hours, if not infinite.
:
:Nonet he less, I have a new server up running 7.0-p1, I'll be pumping  
:a lot of traffic to that box soon and I'll see how that makes out.
:
:-- 
:Robert Blayzor, BOFH
:INOC, LLC
:[EMAIL PROTECTED]
:http://www.inoc.net/~rblayzor/

If it is still giving you trouble I recommend using tcpdump to observe
the IP/port pair of one of the stuck connections over the keepalive
period and see if the keepalives are still being sent and, if they are,
what kind of response you get from the other end.

It is quite possible that the other ends of the connection are still
live and that the issue could very well be a timeout setting in the
server config file instead of something in the TCP stack.

This is what you should see when a keepalive occurs over an idle
connection:

* A TCP packet w/ 0 data sent to the remote
* A response from the remote:  Either a pure ACK, or a TCP RESET

If no response occurs from the remote the keepalive code will then
retry a couple of times over keepintvl (every 30 seconds in your case),
and if it still gets no response after I think 3 retries (30+30+30 = 90
seconds later) it should terminate the connection state.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Sockets stuck in FIN_WAIT_1

2008-05-29 Thread Matthew Dillon

:I think we're onto something here, but for some reason it doesn't make  
:any sense.  I have keepalives turned OFF in Apache:
:
:When I tcpdump this, I see something sending ack's back and forth  
:every 60 seconds, but what?  Apache?  I'm not sure why.   I don't see  
:any timeouts in Apache for ~60 seconds.  As you can see, sometimes we  
:send an ack, but never see a reply.  I'm gathering the OS level  
:keepalives don't come into play because this session is not considered  
:idle?
:
:
:0:13:07.640426 IP 1.1.1.1.80  2.2.2.2.33379: .  
:4208136508:4208136509(1) ack 1471446041 win 520 nop,nop,timestamp  
:3019088951 5004131
:20:13:07.736505 IP 2.2.2.2.33379  1.1.1.1.80: . ack 0 win 0  
:nop,nop,timestamp 5022148 3019088951
:20:14:07.702647 IP 1.1.1.1.80  2.2.2.2.33379: . 0:1(1) ack 1 win 520  
:nop,nop,timestamp 3019148951 5022148
:20:15:07.764920 IP 1.1.1.1.80  2.2.2.2.33379: . 0:1(1) ack 1 win 520  
:nop,nop,timestamp 3019208951 5022148
:20:15:07.860988 IP 2.2.2.2.33379  1.1.1.1.80: . ack 0 win 0  
:nop,nop,timestamp 5058183 3019208951
:20:16:07.827262 IP 1.1.1.1.80  2.2.2.2.33379: . 0:1(1) ack 1 win 520  
:...

Yah, the connection is valid so keepalives do not come into play.
What is happening is that 1.1.1.1 wants to send something to 2.2.2.2,
but 2.2.2.2 is telling 1.1.1.1 that it has no buffer space (win 0).

This forces the TCP stack on 1.1.1.1 (the kernel, not the apache server)
to 'probe' the connection, which it appears to be doing once a minute.
It is probing the connection waiting for 2.2.2.2 to tell it that buffer
space is available (win != 0).

The connection remains valid because 2.2.2.2 continues to respond to
the probes.

Now, the connection is also in a half-closed state, which means that
one direction is closed.  I can't tell which direction that is but my
guess is that 1.1.1.1 (the apache server) closed the 1.1.1.1-2.2.2.2
direction and the 2.2.2.2 box has a broken TCP implementation and can't
deal with it.

:I'm finding several of these sessions doing the same exact thing
:
:-- 
:Robert Blayzor, BOFH
:INOC, LLC

I can suggest two things.  First, the TCP connection is good but you
still may be able to tell Apache, in the apache configuration file, to
timeout after a certain period of time and clear the connection.

Secondly, it may be beneficial to identify exactly what the client and
server were talking about which caused the client to hang with a live
tcp connection.  The only way to do that is to tcpdump EVERYTHING going
on related to the apache srever, save it to a big-ass disk partition
(like 500G), and then when you see a stuck connection go back through
the tcpdump log file and locate it, grep it out, and review what exactly
it was talking about.  You'd have to tcpdump with options to tell it to
dump the TCP data payloads.

It seems likely that the client is running an applet or javascript that
receives a stream over the connection, and that applet or javascript
program has locked up, causing the data sent from the server to build up
and for the client's buffer space to run out, and start advertising the
0 window.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Sockets stuck in FIN_WAIT_1

2008-05-29 Thread Matthew Dillon

:This is exactly what we're seeing, it's VERY strange.  I did kill off  
:Apache, and all the FIN_WAIT_1's stuck around, so the kernel is in  
:fact sending these probe packets, every 60 seconds, which the client  
:responds to... (most of the time).

Ach.  Now that I think about it, it is still possible for it to
happen that way.  Apache closed the connection while there was
still data in the scoket buffer to the client.  The client then
refused to read it, but otherwise left the connection alive.

It's got to a be a bug on the client(s) in question.  I can't think
of anything else.   You may have to resort to injecting a TCP RST
packet (e.g. via a TUN device) to clear the connections.

-Matt
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: udf

2008-05-19 Thread Matthew Dillon
: BTW, Remko has kindly notified me that Reinoud Zandijk has completed his
: long work on UDF write support in NetBSD. I think that porting his work
: is our best chance to get write support in FreeBSD too.
: 
:
:I think you'll find that implementing VOPs and filling in UDF data
:structures will be easy, while interacting with the VM will be many
:orders of magnitude harder.  Still it should be a fun challenge for
:someone to do.
:
:Scott

One avenue that can be persued would be to finish the UIO_NOCOPY support
in the vm/vnode_pager.c.  You have UIO_NOCOPY support for the putpages
code but not the getpages code.

If that were done the VFS can simply use VMIO-backed buffer cache buffers
(they have to be VMIO-backed for UIO_NOCOPY to work properly)...  and
not have to deal with getpages or putpages at all.  The vnode pager
would convert them to a UIO_NOCOPY VOP_READ or VOP_WRITE as appropriate.
The entire VM coding burden winds up being in the kernel proper and
not in the VFS at all.

IMHO implementing per-VFS getpages/putpages is an exercise in frustration,
to be avoided at all costs.  Plus once you have a generic getpages/putpages
layer in vm/vnode_pager.c the VFS code no longer has to mess with VM pages
anywhere and winds up being far more portable.  I did the necessary work
in DragonFly in order to avoid having to mess with VM pages in HAMMER.

Primary work:

* It is a good idea to require that all vnode-based buffer cache buffers
  be B_VMIO backed (aka have a VM object).  It ensures a clean interface
  and avoids confusion, and also cleans up numerous special cases that
  are simply not needed in this day and age.

* Add support for UIO_NOCOPY in generic getpages.  Get rid of all the
  special cases for small-block filesystems in getpages.  Make it
  completely generic and simply issue the UIO_NOCOPY VOP_READ/VOP_WRITE.

* Make minor adjustments to existing VFSs (but nothing prevents them from
  still rolling their own getpages/putpages so no major changes are 
  needed).

And then enjoy the greatly simplified VFS interactions that result.

I would also recommend removing the VOP_BMAP() from the generic
getpages/putpages code and simply letting the VFS's VOP_READ/VOP_WRITE
deal with it.  The BMAP calls were being made from getpages/putpages to
check for discontiguous blocks, to avoid unnecessary disk seeks.  Those
checks are virtually worthless on today's modern hardware particularly
since filesystems already localize most data accesses.  In other words,
if your filesystem is fragmented you are going to be doing the seeks
anyway, probably.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: fsck_ufs: cannot alloc 94208 bytes for inoinfo

2008-02-27 Thread Matthew Dillon
 fsck's memory usage is directly related to the number of inodes and
 the number of directories in the filesystem.  Directories are
 particularly memory intensive.

 I've found on my backup system that a UFS1 filesystem with 40 million
 inodes is about the limit that can be fsck'd (at least with a 32 bit
 architecture).  My cron jobs keep my backup partition below that point.
 Even in a 64 bit environment you will be limited by swap and the sheer
 time it takes for fsck to run.  It takes well over 8 hours for my
 backup system to fsck.

 You can also reduce fsck time by reducing the number of cylinder
 groups on the disk.  I usually max them out (-c 999 and newfs then
 sets it to the maximum, usually in the 50-80 range).  This will
 improve performance but not reduce the memory required.

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: How to take down a system to the point of requiring a newfs with one line of C (userland)

2008-02-18 Thread Matthew Dillon
Jim's original report seemed to indicate that the filesystem paniced
on mount even after repeated fsck's.

That implies that Jim has a filesystem image that panics on mount. 
Maybe Jim can make that image available and a few people can see if
downloading and mounting it reproduces the problem.  It would narrow
things down anyhow.

Also, I didn't see a system backtrace anywhere.  If it paniced, where
did it panic?

The first thing that came to my mind was the dirhash code, but simply
mounting a filesystem doesn't scan the mount point directory at all,
except possibly for '.' or '..'... I don't think it even does that.  All
it does is resolve the root inode of the filesystem.  The code path
for mounting a UFS or UFS2 filesystem is very short.

-Matt
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Quation about HZ kernel option

2007-10-04 Thread Matthew Dillon
The basic answer is that HZ is almost, but not quite irrelevant.

If a process blocks another will immediately be scheduled.  More
importantly, if an interrupt driven event (keyboard, tty, network,
disk, etc) wakes a process up the scheduler has the ability to force
an IMMEDIATE reschedule.   Nearly ALL process related events schedule
the process from this sort of reschedule.  Generally speaking only
cpu-bound processes will be hitting the schedular quantum on a regular
basis.

For network protocols HZ is the basis for the timeout subsystem which
is only triggered when things actually time-out, which is fairly rare
in a normally running system.

Queue timers, select timeouts, and nanosleep are restricted by HZ in
granularity, but in nearly all cases those calls are used with
very large timeouts not really subject to the granularity of HZ.

I think a higher HZ can be somewhat beneficial if you are running a 
lot of processes which fall through the scheduler's cracks (both cpu
and disk bound, usually), or if the scheduler is badly written, but
otherwise a lower value will not have much of an effect.  I would not
go under 100, though.  I personally believe that a default of 1000 is
ridiculously high, especially on a SMP system.

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Quation about HZ kernel option

2007-10-04 Thread Matthew Dillon

:Nuts! Everybody has his own opinion on this matter.
:Any idea how to actually build syntetic but close to real 
:benchmark for  this?

It is literally impossible to write a benchmark to test this, because
the effects you are measuring are primarily scheduling effects related
to the scheduling algorithm and not so much the time quantum.

One can demonstrate that ultra low values of HZ are bad, and ultra high
values of HZ are also bad, but everything in the middle is subject to
so much 'noise', to the type of test, the scheduler algorithm, and so
on and so forth that it is just impossible.

This is probably why there is so much argument over the issue.

:For example:
:Usual web server does:
:1) forks
:2) reads a bunch of small files from disk for some time
:3) forks some cgi scripts
:4) dies
:
:If i write a test in C doing somthing like this and run
:very many of then is parallel for, say, 1 hour and then
:count how many interation have been done with HZ=100 and
:with HZ=1000 will it be a good test for this?
:
:--
:Regards
:Artem

Well, the vast majority of web pages are served in a microsecond
timeframe and clearly not subject to scheduler quantum because the
web server almost immediately blocks.  Literally 100 uS or less and
the web server's work is done.

You can ktrace a web server to see this in action.  Serving pages is
usually either very fast or the process winds up blocking on I/O (again
not subject to the scheduler quantum).

CGIs and applets are another story because they tend to be more 
cpu-intensive, but I would argue that the scheduler algorithm will have
a much larger effect on performance and interactivity then the time
quantum.  You only have so much cpu to play with -- a faster HZ will
not give you more, so if your system is cpu bound it all comes down
to the scheduler selecting which processes it feels are the most
important to run at any given moment.

One might think that quickly switching between processes is a good idea
but there are plenty of workloads where it can have catastrophic results,
such as when a X client is shoving a lot of data to the X server.  In
that case fast switching is bad because efficient client/server
interactions depend very heavily on the client being able to build up
a large buffer of operations for the server to execute in bulk.  X
becomes wildly inefficient with fast switching... It can wind up going
2x, 4x, even 8x slower.

Generally speaking, any pipelined workload suffers with fast switching
whereas non-pipelined workloads tend to benefit.  Operations which can
complete in a short period of time anyway (say 10ms) suffer if they are
switched out, operations which take longer do not.  One of the biggest
problems is that applications tend to operate in absolutes (a different
absolute depending on the application and the situation), whereas the
scheduler has to make decisions based on counting quantums.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: A little story of failed raid5 (3ware 8000 series)

2007-08-24 Thread Matthew Dillon
   A friend of mine once told me that the only worthwhile RAID systems are
   the ones that email you a detailed message when something goes south.

-Matt
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: default dns config change causing major poolpah

2007-08-01 Thread Matthew Dillon
The vast majority of machine installations just slave their dns off
of another machine, and because of that I do not think it is particularly
odious to require some level of skill for those who actually want to set
up their own server.

To that end what I do on DragonFly is simply supply a README file in
/etc/namedb along with a few helper scripts describing how to do it in
a fairly painless manner.  If a user cannot understand the README then
he has no business setting up a DNS server anyhow.  Distributions need to
be fairly sensitive to doing anything that might accidently (through lack
of understanding) cause an overload of critical internet resources.

http://www.dragonflybsd.org/cvsweb/src/etc/namedb/

I generally recommend using our 'getroot' script to download an actual
root.zone file instead of using a hints file (and I guess AXFR is supposed
to replace both concepts).  It has always seemed to me that actually
downloading a physical root zone file once a week is the most reliable
solution.

I've never trusted using a hints file... not for at least a decade,
and I probably wouldn't trust AXFR for the same reason.  Probably my
mistrust is due to the massive problems I had using a hints file long
ago and I'm sure it works better these days, but I've never found any
reason to switch back from an actual root.zone.

I've enclosed the getroot script we ship below.  In anycase, it seems
to me that there is no good reason to try to automate dns services as
a distribution default in the manner being described.  Just my
two-cents.

-Matt

#!/bin/tcsh -f
#
# If you are running named and using root.zone as a master, the root.zone
# file should be updated periodicly from ftp.rs.internic.net.
#
# $DragonFly: src/etc/namedb/getroot,v 1.2 2005/02/24 21:58:20 dillon Exp $

cd /etc/namedb
umask 027

set hostname = 'ftp.rs.internic.net'
set remfile = domain/root.zone.gz
set locfile = root.zone.gz
set path = ( /bin /usr/bin /sbin /usr/sbin )

fetch ftp://${hostname}:/${remfile}
if ( $status != 0) then
rm -f ${locfile}
echo Download failed
else
gunzip  ${locfile}  root.zone.new
if ( $status == 0 ) then
rm -f ${locfile}
if ( -f root.zone ) then
mv -f root.zone root.zone.bak
endif
chmod 644 root.zone.new
mv -f root.zone.new root.zone
echo Download succeeded, restarting named
rndc reload
sleep 1
rndc status
else
echo Download failed: gunzip returned an error
rm -f ${locfile}
endif
endif

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: removing external usb hdd without unmounting causes reboot?

2007-07-31 Thread Matthew Dillon

: By the way, the problem apparently has been solved in
: DragonFly BSD (i.e. DF BSD does not panic when a mounted
: FS is physically removed).  Maybe it is worth to have a

We didn't do much here.  Just started pulling devices, looking at the
crash dumps, and fixing things.

Basically it was just a collection of minor bugs... things like certain
error paths in UFS (which only occur on an I/O error) had bugs, or
caused corruption instead of properly handling the error, and
various bits and pieces of the USB I/O path would get ripped out on
the device pull while still referenced by other bits of the USB I/O
path.

You will also have to look at the way vfs flushing handles errors
in order to allow a filesystem to be force-unmounted after the device
has been pulled.  Basically you have to make umount -f work and you have
to make sure it properly dereferencing the underlying device and properly
destroys the (now unwritable) dirty buffers.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: calcru: runtime went backwards, RELENG_6, SMP

2007-06-12 Thread Matthew Dillon
:s,/kernel,/boot/kernel/kernel, ;-)
:
:well, strange enough result for me:
:
:(kgdb) print cpu_ticks
:$1 = (cpu_tick_f *) 0x8036cef0 rdtsc
:
:Does this mean that kernel uses tsc? sysctl reports
:
:kern.timecounter.choice: TSC(-100) ACPI-fast(1000) i8254(0) dummy(-100)
:kern.timecounter.hardware: ACPI-fast

It means the kernel is using the TSC for calcru.  It's using ACPI-fast
for normal timekeeping.

In anycase, that's the problem right there, or at least one problem.
The TSC cannot safely be used for calcru or much of anything else on
a SMP system because the TSCs aren't synchronized between cpu's and
because their frequencies aren't locked, so they will drift relative
to each other as well. 

If you want to run another test, try disabling the use of the TSC for
calcru.  There is no boot variable I can see to do it so go into
/usr/src/sys/i386/i386/tsc.c and comment out the call to
set_cputicker() in Line 107 and line 187.  Then see if that helps.
If you are doing an amd64 build comment it out in amd64/amd64/tsc.c
line 98 and line 163.

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: calcru: runtime went backwards, RELENG_6, SMP

2007-06-11 Thread Matthew Dillon

:==
: cs3661.rinet.ru 192.38.7.240 2 u  354 1024  3775.305  -66314. 4321.47
: ns.rinet.ru 130.207.244.240  2 u  365 1024  3776.913  -66316. 4305.33
: whale.rinet.ru  195.2.64.5   2 u  358 1024  3777.939  -66308. 4304.90
:
:Any directions to debug this?
:
:Sincerely,
:D.Marck [DM5020, MCK-RIPE, DM3-RIPN]

Since you are running on HEAD now, could you also kgdb the live kernel
and print cpu_ticks?  I believe the sequence is (someone correct me if
I am wrong):

kgdb /kernel /dev/mem
print cpu_ticks

As for further tests... try building a non-SMP kernel (i.e. one that
only recognizes one cpu) and see if the problem occurs there.  That
will determine whether there is a basic problem with time keeping or
whether it is an issue with SMP.

I'm afraid there isn't much more I can do to help, other then to make
suggestions on tests that you can run that will hopefully ring a bell
with another developer.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: calcru: runtime went backwards, RELENG_6, SMP

2007-06-09 Thread Matthew Dillon
:Hmm, i'm not sure I understand you right: what do you mean by 'kgdb live 
:kernel'? I send break over serial console, and in ddb got
:
:db print cpu_ticks
:Symbol not found
:
:Sincerely,
:D.Marck [DM5020, MCK-RIPE, DM3-RIPN]

I think it works the same on FreeBSD, so it would be something like:

kgdb /kernel /dev/mem

^^^ NOTE! Dangerous! ^^^

But I looked at the cvs logs and the variable didn't exist in FreeBSD-6,
so it wouldn't have helped anyway.

It looks like it is using binuptime() in 6.x, and it also looks like
the tick calculations, e.g. rux_uticks, is based on the stat clock
interrupt, whereas the runtime calculation is using binuptime.  There
is no way those two could possibly be synchronized.  No chance
whatsoever.  Your only solution may be to upgrade to FreeBSD-7 which
uses an entirely different mechanism for the calculation (though one
that also seems flawed in its own way).

Alternatively you could just remove the error message from the kernel
entirely and not worry about it.  It's a printf around line 774
in /usr/src/sys/kern/kern_resource.c (in FreeBSD-6.x).

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: calcru: runtime went backwards, RELENG_6, SMP

2007-06-09 Thread Matthew Dillon
:Well, I can of course shut the kernel up, but kernel time stability is still 
my 
:concern. I run ntpd there and while sometimes it seems stable (well, sorta: 
:drift are within several seconds...) there are cases of half-a-minute time 
:steps. 
:
:Sincerely,
:D.Marck [DM5020, MCK-RIPE, DM3-RIPN]

I think the only hope you have of getting the issue addressed is to
run FreeBSD current.  If you can reproduce the time slips under current
the developers should be able to track the problem down and fix it.  The
code is so different between those two releases that they are going to
have a hard time working the problem in FreeBSD-6.

If you don't want to do that, try forcing the timer to use the 8254
and see if that helps.  You may also have to reduce the system tick
to ~100-200 hz.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: calcru: runtime went backwards, RELENG_6, SMP

2007-06-06 Thread Matthew Dillon
:IV  Upd: on GENERIC/amd64 kernel I got the same errors.
:IV 
:IV Do you perhaps run with TSC timecounter? (that's the only cause I've notice
:IV that can generate this message).
:
:Nope:
:
:[EMAIL PROTECTED]:~ sysctl kern.timecounter
:kern.timecounter.tick: 1
:kern.timecounter.choice: TSC(-100) ACPI-fast(1000) i8254(0) dummy(-100)
:kern.timecounter.hardware: ACPI-fast
:...

kgdb your live kernel and 'print cpu_ticks'.  See what the cpu ticker
is actually pointing at, because it might not be the time counter.
It could still be TSC.

The TSC isn't synchronized between the cores on a SMP box, not even
on multi-core parts.  It can't be used to calculate delta times
for any thread that has the possibility of migrating between cpu's.
Not only will the absolute offset be off between cpus, but the frequency
will also be slightly different (at least on SMP multi-core parts),
so you get frequency drift too.

There is also possibly an issue with tc_cpu_ticks(), which seems to
be using a static 64 bit variable to handle rollover instead of
a per-cpu variable.  I don't see how that could possibly be MP safe,
especially if the timecount is not synchronized between cpus and
causes multiple rollover events.

In fact, I can *barely* use the TSC on DragonFly for KTR logging, and
even then I have to have some kernel threads sitting there doing nothing
but figuring out the drift between the cpus so it can correct the
TSC values when it logs information... and even with all of that I
can't get them synchronized any closer then around 500ns from each
other.

I'd recommend that FreeBSD do what we did years ago with calcru ... stop
trying to calculate the time down to the nanosecond and just do it
statistically.  It works just fine and takes the whole mess out of
the critical path.

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Does a pipe take a socket ... ?

2007-05-15 Thread Matthew Dillon
:Marc G. Fournier wrote:
:  For those that remmeber the other day, I had that swzone issue, where I ran 
out 
:  of swap space?  I just about hit it again today, swap was up to 99% used 
... I 
:  was able to get a ps listing in, and there were a whack of find processes 
:  running ...
:  
:  Now, I think I know which VPS they were running in, so that isn't a problem 
... 
:  and I suspect that the find was just part of a longer pipe ... I'm just 
curious 
:  if those pipes would happen to use up any of those sockets that are 
:  'evaporating', or is this totally unrelated to sockets?
:
:In FreeBSD, pipe() is implemented with the socketpair(2)
:system call.  Every pipe uses two sockets (one for each
:endpoint).
:
:Best regards
:   Oliver
:
:-- 
:Oliver Fromme, secnetix GmbH  Co. KG, Marktplatz 29, 85567 Grafing b. M.

Nuh uh.  pipe() is a direct implementation... no sockets anywhere.

Using socketpair() will eat sockets up, but using pipe() will not.

-Matt
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: clock problem

2007-05-11 Thread Matthew Dillon

:One of our customers has 6 GPS-locked NTP servers.  Only problem is
:that two of them are reporting a time that is exactly one second
:different to the other four.  You shouldn't rely solely on your
:GPS or DCF receiver - use it as the primary source but have some
:secondary sources for sanity checks.  (From experience, I can state
:that ntpd does not behave well when presented with two stratum 1
:servers that differ by 1 second).
:
:--=20
:Peter Jeremy

Ntp will also become really unhappy when chunky time slips occur
or if the skew rate is more then a few hundred ppm.  Ntp will also blow
up if it loses the network link for a long period of time.  It will just
give up and stop making corrections entirely, even after the link is
restored.  This is particularly true when it is used over a dialup
(me having done that for over a year in 1997, so I can tell you how
badly it works).

A slow time slip over a day could still be chunky, which would imply
lost interrupts.  Determining whether the problem is due to an 8254
rollover or lost hardclock interrupts is easy... just set 'hz' to
something really high, like 2, and see if your time goes crazy.
If it does, then you have your culprit.

I don't know if those bugs are still present in FreeBSD, but I do
remember that I had to redo all the timekeeping in DragonFly because
lost interrupts from high 'hz' settings were causing timekeeping to
go nuts.  That turned out to mainly be due to the same 8254 timer being
used to generate the hardclock interrupt AND handle time keeping.
i.e. at high hz settings one was not getting the full 1/18 second
benefit from the timer.  You just can't do that... it doesn't work.
It is almost 100% guarenteed to result in a bad time base.

It is easy to test.. just set your kern.hz in the boot env, reboot,
and see if things blow up or not.  Time keeping should be stable
regardless of what hz is set to (provisio: never set hz less then 100).

Unfortunately, all the timebases in the system have their own quirks.
Blame the hardware manufacturers.  The 8254 timer 0 is actually the
MOST consistent of the lot, with the ACPI timer coming a close second.

TSC Haha.  Good luck.  Nice wide timer, easy to read,
but any power savings mode, including the failsafe
modes that intel has when a cpu overheats, will
probably blow it up.  Because of that it is not
really a good idea to use it as a timebase.  I shake
my fist at Intel! $#%$#%$#% 

ACPI timer  Despite the hardware bugs this almost always works
as a timebase, but sometimes the frequency changes
when the cpu goes into power savings mode or EST,
and sometimes the frequency is something other
then what it is supposed to be.

8254 timer 0Almost always works as a timebase, but only if
not also used to generate high-speed interrupts
(because interrupts are lost easily).  Set it to
a full cycle (1/18 second) and you will be fine.
Set it to anything else and you will lose interrupts.

The BIOS will sometimes mess with timer 0, but not
as often as it messes with timer 2.

8254 timer 1Sometimes works as a time base, but can lock older
machines up.  Can even lock up newer machines.
Why?  Because hardware manufacturers are idiots.

8254 timer 2Often can be used as a time base, but video bios
calls often try to use it too.  [EMAIL PROTECTED] bios 
makers!
Still, this is better then losing interrupts when
timer 0 is set to high speed so DragonFly uses
timer 2 for its timebase as a default until the
ACPI timer becomes available, with a boot option
to use timer 1 instead.  Using timer 2 as a time 
base means you don't get motherboard speaker sound
(the old beep beep BEEP!).  Do I care?  No.

LAPIC timer Dunno.  Probably best to use it as a high speed
clock interrupt which would free 8254 timer 0 to
use as a time base.

RTC interrupt   Basically unusable.  Stable, but doesn't have
sufficient resolution to be helpful and takes
forever to read.

-Matt
Matthew Dillon 
[EMAIL PROTECTED

Re: clock problem

2007-05-11 Thread Matthew Dillon
Another idea to help track down timebase problems.  Port dntpd to
FreeBSD.  You need like three sysctls (because the ntp API and the
original sysctl API are both insufficient).  Alternatively you could
probably hack dntpd to run in debug mode without having to implement
any new sysctls, as long as you be sure to clean out any active
kernel timebase adjustments in the kernel before you run it.

Here's some sample output:

http://apollo.backplane.com/DFlyMisc/dntpd.sample01.txt

Dntpd in debug mode will print out the results from two staggered
continuously running linear regressions (resets after 30 samples,
staggered by 15 samples).

For anyone who understands how linear regressions work, finding kernel
timekeeping bugs is really easy with this sort of output.  You get the
slope, y-intercept, correlation, and standard deviation, and then you
get calculated frequency drift and time offset based on those numbers.

The correlation is accurate after around 10 samples.  Note that
frequency drift calculations require longer intervals to get better
results.  The forced 30 second interval set in the sample output is
way too short, hence the errors (it has to be in 90th percentile to
even have a chance of producing a reasonable PPM calculation).  But
also remember we are talking parts per million here.

If you throw away iteration numbers  15 or so you will get very nice
output and kernel bugs will show up in fairly short order.  Kernel
bugs will show up as non-trivial y-intercept calculations over
multiple samples, large jumps in the offset, inability to get a good
correlation (provisio: sample interval has to be at least 120 seconds,
not the 30 in my example), and so on and so forth.

Also be sure to use a locked ntp source, otherwise running corrections on
the source will show up as problems in the debug output.  ntp.pool.org
is usually good enough.  It's fun checking various time sources with
an idle box with a good timebase. hhahahhaha. OMG.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


RE: Creating one's own installer/mfsroot

2007-05-09 Thread Matthew Dillon
: You could also look at the INSTALL guides for (early versions of?)
: Dragonfly, it taught me how to install a BSD system from 
: scratch, using
: only what's in base of liveCD :)
:...
:
:We have set up a boot CD (or pxeboot/nfs environment) where we can run a
:Ruby script that will take directives from a configuration file, configure
:the disks, slices and partitions, align the partitions to start on a block
:...

I came up with a neat little script-driven remote configurator called
rconfig (/usr/src/sbin/rconfig in the DragonFly codebase) as an
alternative to the standard features in our installer.  I recommend
checking it out.

http://www.dragonflybsd.org/cvsweb/src/sbin/rconfig/

rconfig is really easy to use.  Basically its just a client/server pair
with socket broadcast capabilities.  All it does is negotiate a
shell script download from the server to the client (server on the
same subnet), then runs the script on the client.  That's it.

I wanted to be able to boot a CD, login as root, dhclient the network
up, and then just go 'rconfig -a' and have my script do the rest.  It
takes a bit of time to write the shell script to do a full install
from fdisk to completion, but if you have a fully working CD based
environment (all the binaries in /, /usr, a writable /tmp, /etc, and
so forth)... then shell scripts are just as easy to write as they are
on fully installed machines.

I use rconfig to do fresh installs of my test boxes from CD, with all the
customization, my ssh keys, fstab entries, NFS mounts, etc that I need
to be able to ssh into the box and start using it immediately.

NFS booting is 'ok', but requires a lot of infrastructure and gets
out of date easily.  Often you also have to mess with the BIOS settings,
which is very annoying because you have to change them back after you
finish the install.  I used NFS booting for a while, but just couldn't
depend on it always being operational.  With rconfig I just leave the
rconfig server running on one of my boxes and the worst I have to do
is tweak my script a bit.  And adding smarts to the script is easy
whereas you just can't add much smarts to a NFS boot without a lot of
messing around with the RC sequence.

In anycase, check it out.  My assumption is that rconfig would compile
nearly without modification on a FreeBSD box.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: swap zone exhausted, increase kern.maxswzone

2007-05-05 Thread Matthew Dillon
Basically maxswzone is the amount of KVM the kernel is willing to
use to store 'struct swblock' structures.

These are the little structures that are stuck onto VM objects and
specify which pages in the VM object(s) correspond to which pages
of swap, for any swapped out data that no longer has a vm_page_t.

It should be almost impossible to run out.  Each structure can handle
16 contiguous swap block assignments in the VM object.  Pages in
VM objects tend to get swapped out in large linear swaths and the
dynamic nature of paging tends to enforce this even if things are
a bit chunky initially.  So running out should just never happen.

The only thing I can think of is if a machine has a tiny, tiny amount
of ram and a huge amount of swap.  e.g. like 64M of ram and 2G of swap,
and actually tries to use it all.  The default KVM reservation is based
on physical memory, I think.  Otherwise, it just shouldn't happen.

I see that the code in FreeBSD is using UMA now, I would double check
that it is actually calculating the proper amount of space allowed to be
allocated.  Maybe you have a leak somewhere.

Note that swap interactions have to operate in low-memory situations.
Make sure UMA isn't gonna have a meltdown if the system is running low
on freeable VM pages.

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: swap zone exhausted, increase kern.maxswzone

2007-05-05 Thread Matthew Dillon

:If you do have 8gigs of swap, then you do need to increase the parameter..
:The default is 7.7gigs of supported swap...  (assuming that struct swblock
:hasn't changed size...  The maxswblock only limits it... If swap is more
:than 8x memory, then changing kern.maxswzone will not fix it and will
:require a code change...
:
:-- 
:  John-Mark Gurney Voice: +1 415 225 5579

The swblock structures only apply to actively swapped out data.  Mark,
how much data is actually swapped out (pstat -s) at the time the
problem is reported?

If you can dump UMA memory statistics that would be beneficial as well.
I just find it hard to imagine that any system would actually be using
that much swap, but hey! :-)

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: swap zone exhausted, increase kern.maxswzone

2007-05-05 Thread Matthew Dillon
:That's why I think that the socket issue and this one are co-related ... with 
:everything started up (93 jails), my swap usage right now is:
:
:mars# pstat -s
:Device  1K-blocks UsedAvail Capacity
:/dev/da0s1b   8388608   20  8388588 0%
:
:Its only been up 2.5 hours so far, but still, everything is started up ...
:
:- 
:Marc G. Fournier   Hub.Org Networking Services (http://www.hub.org)

The swap zone exhausted, increase kern.maxswzone message only prints
if uma_zone_exhausted() returns TRUE.  uma_zone_exhausted() appears to
be based on a UMA flag which is only set if the pages for the zone
exceeds some maximum setting.

Insofar as I can tell, vmstat -z on FreeBSD will dump the UMA zones,
so try using that when the problem occurs along with pstat -s.  It
sounds like there is a leak somewhere (but I don't see how anything
in any other UMA zones could cause the SWAPMETA zone to fill up).  Or
the maximum setting is too low, or something is getting lost somewhere.

We'll have a better idea as to what is going on when you get the message
again.  You might even want to do a once-a-10-minutes cron job to
append pstat -s, vmstat -m, and vmstat -z to a file.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Socket leak (Was: Re: What triggers No Buffer Space) Available?

2007-05-03 Thread Matthew Dillon
:I'm trying to probe this as well as I can, but network stacks and sockets have 
:never been my strong suit ...
:
:Robert had mentioned in one of his emails about a Sockets can also exist 
:without any referencing process (if the application closes, but there is still 
:data draining on an open socket).
:
:Now, that makes sense to me, I can understand that ... but, how would that 
look 
:as far as netstat -nA shows?  Or, would it?  For example, I have:
:
:...

Netstat should show any sockets, whether they are attached to processes
or not.  Usually you can match up the address from netstat -nA with
the addresses from sockets shown by fstat to figure out what processes
the sockets are attached to.

There are three situations that you have to watch out for:

(1) The socket was close()'d and is still draining.  The socket
will timeout and terminate within ~1-5 minutes.  It will not
be referenced to a descriptor or process.

(2) The socket descriptor itself has been sent over a unix domain socket
from one process to another and is currently in transit.  The 
file pointer representing the descriptor is what is actually in
transit, and will not be referenced by any processes while it is
in transit.

There is a garbage collector that figures out unreferencable loops.
I think its called unp_gc or something like that.

(3) The socket is not closed, but is idle (like having a remote shell
open and never typing in it).  Service processes can get stuck
waiting for data on such sockets.  The socket WILL be referenced
by some process.

These are controlled by net.inet.tcp.keep* and
net.inet.tcp.always_keepalive.  I almost universally turn on
net.inet.tcp.always_keepalive to ensure that dead idle connections
get cleaned out.

Note that keepalive only applies to idle connections.  A socket
that has been closed and needs to drain (either data or the FIN
state) will timeout and clean up itself whether keepalive is
turned on or off).

netstat -nA will give you the status of all your sockets.  You can
observe the state of any TCP sockets.

Unix domain sockets have no state and closure is governed simply by
them being dereferenced, just like a pipe.  In this case there are really
only two situations:  (1) One end of the unix domain socket is still
referenced by a process or (2) The socket has been sent over another
unix domain socket and is 'in transit'.  The socket will remain intact
until it is either no longer in transit (read out from the other unix
domain socket), or the garbage collector determines that the socket the
descripor is transiting over is not externally referencablee, and
will destroy it and any in-transit sockets contained within.

Any sockets that don't fall into these categories are in trouble...
either a timer has failed somewhere or (if unix domain) the garbage
collector has failed to detect that it is in an unreferencable loop.

-

One thing you can do is drop into single user mode... kill all the 
processes on the system, and see if the sockets are recovered.  That
will give you a good idea as to whether it is a real leak or whether
some process is directly or indirectly (by not draining a unix domain
socket on which other sockets are being transfered) holding onto the
socket.

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Socket leak (Was: Re: What triggers No Buffer Space) Available?

2007-05-03 Thread Matthew Dillon

:*groan*  why couldn't this be happening on a server that I have better remote 
:access to? :(
:
:But, based on your explanation(s) above ... if I kill off all of the jail(s) 
on 
:the machine, so that there are minimal processes running, shouldn't I see a 
:significant drop in the number of sockets in use as well?  or is there 
:something special about single user mode vs just killing off all 'extra 
:processes'?
:
:- 
:Marc G. Fournier   Hub.Org Networking Services (http://www.hub.org)

Yes, you can.  Nothing special about single user... just kill all
the processes that might be using sockets.  Killing the jails is a good
start.

If you are running a lot of jails then I would strongly suspect that
there is an issue with file desciptor passing over unix domain sockets.
In particular, web servers, databases, and java or other applets could
be the culprit.

Other possibilities... you could just be running out of file descriptors
in the file descriptor table.

use vmstat -m and vmstat -z too... find out what allocates the socket
memory and see what it reports.  Check your mbuf allocation statistics
too (netstat -m).  Damn, I wish that information were collected
on a per-jail basis but I don't think it is.  Look at all the memory
statistics and check to see if anything is growing unbounded over a
long period of time (verses just growing into a cache balance).  Create
a cron job that dumps memory statistics once a minute to a file then
break each report with a clear-screen sequence and cat it in a really
big xterm window.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Xen Dom0, are we making progress?

2007-03-29 Thread Matthew Dillon
 kernels won't be winning any rewards, but they sure can be 
convenient.  Most of my kernel development is now done in virtual
kernels.  It also makes kernel development more attainable to people
who are not traditionally kernel coders.  The synergy is very good.

--

In anycase, as usual I rattle on.  If FreeBSD is interested I recommend
simply looking at the cool features I added to DragonFly's kernel to
make virtual kernels possible.  It's really just three major items:
Signal mailboxes, a new MAP_VPAGETABLE for mmap, and the new vmspace_*()
system calls for managing VM spaces.  Once those features were in place
it didn't take long for me to create a 'vkernel' platform that linked
against libc and used the new system calls.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Xen Dom0, are we making progress?

2007-03-29 Thread Matthew Dillon

:Virtual kernels are a cool idea, but I (and I believe practically anyone
:using FreeBSD for non-development work) would much rather see a Xen-like
:functionality (to be precise: ability to run foreign kernels and
:Windows; qemu is too slow) than just a variation of the native kernel.

There is certainly a functionality there that people will find useful,
but you also have to realize that Xen involves two or more distinct
operating systems which will multiply the number of bugs you have to
deal with and create major compatibility issues with the underlying
hardware, making it less then reliable.

Really only the disk and network I/O can be made reliably compatible
in a Xen installation.  Making sound cards, video capture cards, 
encryption cards, graphics engines, and many other hardware features
work well with the guest operating system will not only be difficult,
but it will also be virtually unmaintainable in that environment over
the long term.  Good luck getting anything more then basic application
functionality out of it.

For example, you would have no problem running pure network applications
such as web and mail servers on the guest operating system, but the
moment you delve outside of that box and into sound and high quality 
(or high performance) video, things won't be so rosy.

I don't see much of an advantage in having multi-OS hardware 
virtualization for any serious deployment.  It would be interesting and
useful on a personal desktop, at least within the scope of the limited
hardware compatibility, but at the same time it will also lock you into
software and OS combinations that aren't likely to extend into the
future, and which will be a complete an utter nightmare to maintain.
Any failure at all could lead to a completely unrecoverable system.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ntpd flipping between PLL and FLL mode

2006-12-19 Thread Matthew Dillon

:How would decreasing the polling time fix this?  I do not understand
:the semantics/behaviour of NTP very well.
:
:Taken from the manpage:
:
:  maxpoll maxpoll
:  These options specify the minimum and maximum poll intervals for
:  NTP messages, in seconds to the power of two.  The maximum poll
:  interval defaults to 10 (1,024 s), but can be increased by the
:  maxpoll option to an upper limit of 17 (36.4 h).  The minimum
:  poll interval defaults to 6 (64 s), but can be decreased by the
:  minpoll option to a lower limit of 4 (16 s).

Though I can't speak to the algorithm ntpd uses, if a correllation
is used along with a standard deviation to calculate offset and frequency
errors, then decreasing the polling interval makes it virtually
impossible to get an accurate frequency lock.  Frequency locks require
long polling intervals.

So you wouldn't see any flips (or fewer flips), but you wouldn't have
a very accurate time base either.  You know you have a bad frequency
correction if you see significant offset corrections occuring every day.
The whole concept of 'flips' is broken anyhow, it just means the
application is not using the correct mathmatical algorithm.

NTPD never worked very well for me in all the years I've used it.  Not
ever.  OpenNTPD also uses an aweful algorithm.  If you need a NTP
client-only app you might want to consider porting our DNTPD.  It is a
client-only app (no server component) which uses two staggered 
correllations and two staggered standard deviations for each time
source and corrects the time based on a mathmatically calculated 
accuracy rather then after some pre-contrived time delay or interval.

Some minor messing around might be needed to port it since we use a
slightly more advanced sysctl scheme to control offset and frequency
correction.  It also has a tendancy to make errors in OS time keeping 
obvious.  In particular, any bugs in how the OS handles offset and
frequency corrections will become very obvious.  We found a
microsecond-vs-nanosecond bug in DragonFly with it.

If you have a good frequency lock you should not see offset corrections
occuring very often.  I've included examples of what you should be able
to achieve below from a few of our machines.  In the examples below I
got a reasonable frequency lock within an hour and then did not have to
correct for it after that (which means that the error calculation for
the continuously running frequency drift calculation was not small
enough to make further frequency corrections).  These are using the
pool NTP sources on the internet.  With a LAN source you would probably
see more frequency corrections.

Correllations are only useful with a limited number of samples... once
you get beyond 30 samples or so the algorithm tends to plateau, which
is why you need to have at least two running correllations with staggered
start times.  I have considered adding two additional staggered
correllations to get more accurate frequency locks (e.g. 30 2 hour
samples in addition to 30 30 minute samples) but PC time bases just aren't
accurate enough to justify it (I know of no PC motherboards which use
temperature-corrected crystal time bases.  They are all uncorrected time
bases.  It's really annoying).  Ah well.

-Matt

Dec  3 10:46:57 crater dntpd[605]: dntpd version 1.0 started
Dec  3 10:47:13 crater dntpd[605]: issuing offset adjustment: 0.706663
Dec  3 11:29:32 crater dntpd[605]: issuing offset adjustment: 0.015905
Dec  3 11:39:57 crater dntpd[605]: issuing frequency adjustment:  8.656ppm
Dec  3 11:50:25 crater dntpd[605]: issuing offset adjustment: 0.011579
Dec  4 09:21:18 crater dntpd[605]: issuing offset adjustment: -0.007325
Dec  5 20:26:08 crater dntpd[605]: issuing offset adjustment: 0.007002
Dec  6 09:20:32 crater dntpd[605]: issuing offset adjustment: -0.008491
Dec  6 09:40:11 crater dntpd[605]: issuing offset adjustment: 0.004089
Dec  6 22:23:50 crater dntpd[605]: issuing offset adjustment: 0.006602
Dec  6 22:43:16 crater dntpd[605]: issuing offset adjustment: -0.002391
Dec  8 13:29:11 crater dntpd[605]: issuing offset adjustment: -0.005005
Dec 11 23:37:00 crater dntpd[605]: issuing offset adjustment: 0.004607
Dec 17 23:11:26 crater dntpd[605]: issuing offset adjustment: -0.005559
Dec 18 23:05:12 crater dntpd[605]: issuing offset adjustment: 0.008101

Dec  3 10:47:13 leaf dntpd[593]: dntpd version 1.0 started
Dec  3 10:47:29 leaf dntpd[593]: issuing offset adjustment: 0.027401
Dec  3 11:08:45 leaf dntpd[593]: issuing frequency adjustment: -12.384ppm
Dec  3 13:14:49 leaf dntpd[593]: issuing offset adjustment: -0.012258
Dec  3 20:14:44 leaf dntpd[593]: issuing offset adjustment: -0.010502
Dec 10 04:27:05 leaf dntpd[593]: issuing offset adjustment: -0.008231
Dec 16 

Re: Maximum Swapsize

2006-04-11 Thread Matthew Dillon
From 'man tuning' (I think I wrote this, a long time ago):

 You should typically size your swap space to approximately 2x main mem-
 ory.  If you do not have a lot of RAM, though, you will generally want a
 lot more swap.  It is not recommended that you configure any less than
 256M of swap on a system and you should keep in mind future memory expan-
 sion when sizing the swap partition.  The kernel's VM paging algorithms
 are tuned to perform best when there is at least 2x swap versus main mem-
 ory.  Configuring too little swap can lead to inefficiencies in the VM
 page scanning code as well as create issues later on if you add more mem-
 ory to your machine.  Finally, on larger systems with multiple SCSI disks
 (or multiple IDE disks operating on different controllers), we strongly
 recommend that you configure swap on each drive (up to four drives).  The
 swap partitions on the drives should be approximately the same size.  The
 kernel can handle arbitrary sizes but internal data structures scale to 4
 times the largest swap partition.  Keeping the swap partitions near the
 same size will allow the kernel to optimally stripe swap space across the
 N disks.  Do not worry about overdoing it a little, swap space is the
 saving grace of UNIX and even if you do not normally use much swap, it
 can give you more time to recover from a runaway program before being
 forced to reboot.
--

The last sentence is probably the most important.  The primary reason why 
you want to configure a fairly large amount of swap has less to do with
performance and more to do with giving the system admin a long runway
to have the time to deal with unexpected situations before the machine
blows itself to bits.

The swap subsystem has the following limitation:

/*
 * If we go beyond this, we get overflows in the radix
 * tree bitmap code.
 */
if (nblks  0x4000 / BLIST_META_RADIX / nswdev) {
printf(exceeded maximum of %d blocks per swap unit\n,
0x4000 / BLIST_META_RADIX / nswdev);
VOP_CLOSE(vp, FREAD | FWRITE, td);
return (ENXIO);
}

By default, BLIST_META_RADIX is 16 and nswdev is 4, so the maximum
number of blocks *PER* swap device is 16 million.  If PAGE_SIZE is 4K,
the limitation is 64 GB per swap device and up to 4 swap devices
(256 GB total swap).

The kernel has to allocate memory to track the swap space.  This memory
is allocated and managed by kern/subr_blist.c (assuming you haven't
changed things since I wrote it).  This is basically implemented as a
flattened radix tree using a fixed radix of 16.  The memory overhead is
fixed (based on the amount of swap configured) and comes to
approximately 2 bits per VM page.  Performance is approximately O(log N).

Additionally, once pages are actually swapped out the VM object must
record the swap index for each page.  This costs around 4 bytes per
swapped-out page and is probably the greatest limiting factor in the
amount of swap you can actually use.  256GB of 100% used swap would
eat 256MB of kernel ram.

I believe that large linear chunks of reserved swap, such as used by MD,
currently still require the per-page overhead.  However, theoretically,
since the reservation model uses a radix tree, it *IS* possible to
reserve huge swaths of linear-addressed swap space with no per-page
storage requirements in the VM object.  It is even possible to do away
with the 2 bits per page that the radix tree uses if the radix tree
were allocated dynamically.  I decided against doing that because I
did not want the swap subsystem to be reliant on malloc() during 
critical low-memory paging situations.

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

2006-03-25 Thread Matthew Dillon

:The results here are weird.  With 1GB RAM and a 2GB dataset, the
:timings seem to depend on the sequence of operations: reading is
:significantly faster, but only when the data was mmap'd previously
:There's one outlier that I can't easily explain.
:...
:Peter Jeremy

Really odd.  Note that if your disk can only do 25 MBytes/sec, the
calculation is: 2052167894 / 25MB = ~80 seconds, not ~60 seconds 
as you would expect from your numbers.

So that would imply that the 80 second numbers represent read-ahead,
and the 60 second numbers indicate that some of the data was retained
from a prior run (and not blown out by the sequential reading in the
later run).

This type of situation *IS* possible as a side effect of other
heuristics.  It is particularly possible when you combine read() with
mmap because read() uses a different heuristic then mmap() to
implement the read-ahead.  There is also code in there which depresses
the page priority of 'old' already-read pages in the sequential case.
So, for example, if you do a linear grep of 2GB you might end up with
a cache state that looks like this:

l = low priority page
m = medium priority page
h = high priority page

FILE: [---m]

Then when you rescan using mmap,

FILE: [l--m]
  [--lm]
  [-l-m]
  [l--m]
  [---l---m]
  [--lm]
  [-llHHHmm]
  [lllHHmmm]
  [---H]
  [---mmmHm]

The low priority pages don't bump out the medium priority pages
from the previous scan, so the grep winds up doing read-ahead
until it hits the large swath of pages already cached from the
previous scan, without bumping out those pages.

There is also a heuristic in the system (FreeBSD and DragonFly)
which tries to randomly retain pages.  It clearly isn't working :-)
I need to change it to randomly retain swaths of pages, the
idea being that it should take repeated runs to rebalance the VM cache
rather then allowing a single run to blow it out or allowing a 
static set of pages to be retained indefinitely, which is what your
tests seem to show is occuring.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

2006-03-24 Thread Matthew Dillon

:On an amd64 system running about 6-week old -stable, both behave
:pretty much identically.  In both cases, systat reports that the disk
:is about 96% busy whilst loading the cache.  In the cache case, mmap
:is significantly faster.
:
:...
:turion% ls -l /6_i386/var/tmp/test
:-rw-r--r--  1 peter  wheel  586333684 Mar 24 19:24 /6_i386/var/tmp/test
:turion% /usr/bin/time -l grep dfhfhdsfhjdsfl /6_i386/var/tmp/test
:   21.69 real 0.16 user 0.68 sys
:[umount/remount /6_i386/var]
:turion% /usr/bin/time -l grep --mmap dfhfhdsfhjdsfl /6_i386/var/tmp/test
:   21.68 real 0.41 user 0.51 sys
:The speed gain with mmap is clearly evident when the data is cached and
:the CPU clock wound right down (99MHz ISO 2200MHz):
:...
:-- 
:Peter Jeremy

That pretty much means that the read-ahead algorithm is working.
If it weren't, the disk would not be running at near 100%.

Ok.  The next test is to NOT do umount/remount and then use a data set
that is ~2x system memory (but can still be mmap'd by grep).  Rerun
the data set multiple times using grep and grep --mmap.

If the times for the mmap case blow up relative to the non-mmap case,
then the vm_page_alloc() calls and/or vm_page_count_severe() (and other
tests) in the vm_fault case are causing the read-ahead to drop out.
If this is the case the problem is not in the read-ahead path, but 
probably in the pageout code not maintaining a sufficient number of
free and cache pages.  The system would only be allocating ~60MB/s
(or whatever your disk can do), so the pageout thread ought to be able
to keep up.

If the times for the mmap case do not blow up, we are back to square
one and I would start investigating the disk driver that Mikhail is
using.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: flushing anonymous buffers over NFS is rejected by server (more weird bugs with mmap-ing via NFS)

2006-03-23 Thread Matthew Dillon

:This doesn't work with modes like 446 (which allow writing by everyone
:not in a particular group).

It should work just fine.  The client validated the creds as of the
original operation (such as the mmap() or the original write()).
Regardless of what happens after that, if the creds were valid when
the original operation occured, then the server should allow the write.
If the client supplies root creds for a later operation and the server
translated that to mean 'write it if its possible to write without root
creds' for exports whos roots were not mapped to root, it would actually
conform better to the reality of the state of the file at the time the
client originally performed the operation verses if the client provided
the user creds of the original write.

If the file were chmoded or chowned inbetween the original write
and the actual I/O operation then it is arguable that the delayed
write I/O should succeed rather then fail.

:Doesn't that amount to significantly reducing the security of NFS?
:ISTR the original reason for nobody was that it was trivial to fake
:root so the server would map it to an account with (effectively) no
:privileges.  This change would give root on a client (file) privileges
:equal to the union of every non-root user on the server.  In
:particular, it appears that the server can't tell if a file was opened
:for read or write so a client could open a file for reading (getting a
:valid FH) and then write to it (even though it couldn't have opened the
:file for writing).
:
:-- 
:Peter Jeremy

No, it has no effect on the security of NFS.  With the exception of
'root' creds, the server trusts the client's creds, so there isn't
going to be any real difference between the client supplying user creds
verses the server translating root creds into some non-root user's creds.

NFS has never been secure.  The only reasonably secure method of
exporting a NFS filesystem is to export an entire filesystem read-only.
For any read-write export, NFS is only secure insofar as you assume
that the client can then modify any file in the exported filesystem.
The 'maproot' option is a bandaid at best, and not a very good one.

For example, exporting subdirectories of a filesystem is not secure
(and never was).  It is fairly trivial for a client to supply file
handles that are outside of the subdirectory tree that was exported.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

2006-03-23 Thread Matthew Dillon

:Actually, I can not agree here -- quite the opposite seems true. When running 
:locally (no NFS involved) my compressor with the `-1' flag (fast, least 
:effective compression), the program easily compresses faster, than it can 
:read.
:
:The Opteron CPU is about 50% idle, *and so is the disk* producing only 15Mb/s. 
:I guess, despite the noise I raised on this subject a year ago, reading via 
:mmap continues to ignore the MADV_SEQUENTIONAL and has no other adaptability.
:
:Unlike read, which uses buffering, mmap-reading still does not pre-fault the 
:file's pieces in efficiently :-(
:
:Although the program was written to compress files, that are _likely_ still in 
:memory, when used with regular files, it exposes the lack of mmap 
:optimization.
:
:This should be even more obvious, if you time searching for a string in a 
:large file using grep vs. 'grep --mmap'.
:
:Yours,
:
:   -mi
:
:http://aldan.algebra.com/~mi/mzip.c

Well, I don't know about FreeBSD, but both grep cases work just fine on
DragonFly.  I can't test mzip.c because I don't see the compression
library you are calling (maybe that's a FreeBSD thing).  The results
of the grep test ought to be similar for FreeBSD since the heuristic
used by both OS's is the same.  If they aren't, something might have
gotten nerfed accidently in the FreeBSD tree.

Here is the cache case test.  mmap is clearly faster (though I would
again caution that this should not be an implicit assumption since
VM fault overheads can rival read() overheads, depending on the
situation).

The 'x1' file in all tests below is simply /usr/share/dict/words
concactenated over and over again to produce a large file.

crater# ls -la x1
-rw-r--r--  1 root  wheel  638228992 Mar 23 11:36 x1
[ machine has 1GB of ram ]

crater# time grep --mmap asdfasf x1
1.000u 0.117s 0:01.11 100.0%10+40k 0+0io 0pf+0w
crater# time grep --mmap asdfasf x1
0.976u 0.132s 0:01.13 97.3% 10+40k 0+0io 0pf+0w
crater# time grep --mmap asdfasf x1
0.984u 0.140s 0:01.11 100.9%10+41k 0+0io 0pf+0w

crater# time grep asdfasf x1
0.601u 0.781s 0:01.40 98.5% 10+42k 0+0io 0pf+0w
crater# time grep asdfasf x1
0.507u 0.867s 0:01.39 97.8% 10+40k 0+0io 0pf+0w
crater# time grep asdfasf x1
0.562u 0.812s 0:01.43 95.8% 10+41k 0+0io 0pf+0w

crater# iostat 1
[ while grep is running, in order to test the cache case and verify that
  no I/O is occuring once the data has been cached ]


The disk I/O case, which I can test by unmounting and remounting the
partition containing the file in question, then running grep, seems
to be well optimized on DragonFly.  It should be similarly optimized
on FreeBSD since the code that does this optimization is nearly the
same.  In my test, it is clear that the page-fault overhead in the
uncached case is considerably greater then the copying overhead of
a read(), though not by much.  And I would expect that, too.

test28# umount /home
test28# mount /home
test28# time grep asdfasdf /home/x1
0.382u 0.351s 0:10.23 7.1%  55+141k 42+0io 4pf+0w
test28# umount /home
test28# mount /home
test28# time grep asdfasdf /home/x1
0.390u 0.367s 0:10.16 7.3%  48+123k 42+0io 0pf+0w

test28# umount /home
test28# mount /home
test28# time grep --mmap asdfasdf /home/x1
0.539u 0.265s 0:10.53 7.5%  36+93k 42+0io 19518pf+0w
test28# umount /home
test28# mount /home
test28# time grep --mmap asdfasdf /home/x1
0.617u 0.289s 0:10.47 8.5%  41+105k 42+0io 19518pf+0w
test28# 

test28# iostat 1 during the test showed ~60MBytes/sec for all four tests

Perhaps you should post specifics of the test you are running, as well
as specifics of the results you are getting, such as the actual timing
output instead of a human interpretation of the results.  For that
matter, being an opteron system, were you running the tests on a UP
system or an SMP system?  grep is a single-threaded so on a 2-cpu
system it will show 50% cpu utilization since one cpu will be 
saturated and the other idle.  With specifics, a FreeBSD person can
try to reproduce your test results.

A grep vs grep --mmap test is pretty straightforward and should be
a good test of the VM read-ahead code, but there might always be some
unknown circumstance specific to a machine configuration that is
the cause of the problem.  Repeatability and reproducability by
third parties is important when diagnosing any problem.

Insofar as MADV_SEQUENTIAL goes... you shouldn't need it on FreeBSD.
Unless someone ripped it out since I committed it many years ago, which
I doubt, FreeBSD's VM heuristic will figure out that the accesses
are sequential and start issuing read-aheads.  It should pre-fault, and
it should do read-ahead.  That isn't to say that there isn't a bug, just
that everyone interested in the problem has to be able to reproduce it
and help each other track down the source.  Just making 

Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

2006-03-23 Thread Matthew Dillon

:Yes, they both do work fine, but time gives very different stats for each. In 
:my experiments, the total CPU time is noticably less with mmap, but the 
:elapsed time is (much) greater. Here are results from FreeBSD-6.1/amd64 -- 
:notice the large number of page faults, because the system does not try to 
:preload file in the mmap case as it does in the read case:
:
:   time fgrep meowmeowmeow /home/oh.0.dump
:   2.167u 7.739s 1:25.21 11.6% 70+3701k 23663+0io 6pf+0w
:   time fgrep --mmap  meowmeowmeow /home/oh.0.dump
:   1.552u 7.109s 2:46.03 5.2%  18+1031k 156+0io 106327pf+0w
:
:Use a big enough file to bust the memory caching (oh.0.dump above is 2.9Gb), 
:I'm sure, you will have no problems reproducing this result.

106,000 page faults.  How many pages is a 2.9GB file?  If this is running
in 64-bit mode those would be 8K pages, right?  So that would come to 
around 380,000 pages.  About 1:4.  So, clearly the operating system 
*IS* pre-faulting multiple pages.  

Since I don't believe that a memory fault would be so inefficient as
to account for 80 seconds of run time, it seems more likely to me that
the problem is that the VM system is not issuing read-aheads.  Not
issuing read-aheads would easily account for the 80 seconds.

It is possible that the kernel believes the VM system to be too loaded
to issue read-aheads, as a consequence of your blowing out of the system
caches.  It is also possible that the read-ahead code is broken in
FreeBSD.  To determine which of the two is more likely, you have to
run a smaller data set (like 600MB of data on a system with 1GB of ram),
and use the unmount/mount trick to clear the cache before each grep test.

If the time differential is still huge using the unmount/mount data set
test as described above, then the VM system's read-ahead code is broken.
If the time differential is tiny, however, then it's probably nothing
more then the kernel interpreting your massive 2.9GB mmap as being
too stressful on the VM system and disabling read-aheads for that
reason.

In anycase, this sort of test is not really a good poster child for how
to use mmap().  Nobody in their right mind uses mmap() on datasets that
they expect to be uncacheable and which are accessed sequentially.  It's
just plain silly to use mmap() in that sort of circumstance.  This is
a trueism on ANY operating system, not just FreeBSD.  The uncached
data set test (using unmount/mount and a dataset which fits into memory)
is a far more realistic test because it simulates the most common case
encountered by a system under load... the accessing of a reasonably sized
data set which happens to not be in the cache.

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

2006-03-23 Thread Matthew Dillon

:I thought one serious advantage to this situation for sequential read
:mmap() is to madvise(MADV_DONTNEED) so that the pages don't have to
:wait for the clock hands to reap them.  On a large Solaris box I used
:to have the non-pleasure of running the VM page scan rate was high, and
:I suggested to the app vendor that proper use of mmap might reduce that
:overhead.  Admitedly the files in question were much smaller than the
:available memory, but they were also not likely to be referenced again
:before the memory had to be reclaimed forcibly by the VM system.
:
:Is that not the case?  Is it better to let the VM system reclaim pages
:as needed?
:
:Thanks,
:
:Gary

madvise() should theoretically have that effect, but it isn't quite
so simple a solution.

Lets say you have, oh, your workstation, with 1GB of ram, and you
run a program which runs several passes on a 900MB data set.
Your X session, xterms, gnome, kde, etc etc etc all take around 300MB
of working memory.

Now that data set could fit into memory if portions of your UI were
pushed out of memory.  The question is not only how much of that data
set should the kernel fit into memory, but which portions of that data
set should the kernel fit into memory and whether the kernel should
bump out other data (pieces of your UI) to make it fit.

Scenario #1:  If the kernel fits the whole 900MB data set into memory,
the entire rest of the system would have to compete for the remaining
100MB of memory.  Your UI would suck rocks.

Scenario #2: If the kernel fits 700MB of the data set into memory, and
the rest of the system (your UI, etc) is only using 300MB, and the kernel
is using MADV_DONTNEED on pages it has already scanned, now your UI
works fine but your data set processing program is continuously 
accessing the disk for all 900MB of data, on every pass, because the
kernel is always only keeping the most recently accessed 700MB of
the 900MB data set in memory.

Scenario #3: Now lets say the kernel decides to keep just the first
700MB of the data set in memory, and not try to cache the last 200MB
of the data set.  Now your UI works fine, and your processing program
runs FOUR TIMES FASTER because it only has to access the disk for
the last 200MB of the 900MB data set.

--

Now, which of these scenarios does madvise() cover?  Does it cover
scenario #1?  Well, no.  the madvise() call that the program makes has
no clue whether you intend to play around with your UI every few minutes,
or whether you intend to leave the room for 40 minutes.  If the kernel
guesses wrong, we wind up with one unhappy user.  

What about scenario #2?  There the program decided to call madvise(),
and the system dutifully reuses the pages, and you come back an hour
later and your data processing program has only done 10 passes out
of the 50 passes it needs to do on the data and you are PISSED.

Ok.  What about scenario #3?  Oops.  The program has no way of knowing
how much memory you need for your UI to be 'happy'.  No madvise() call
of any sort will make you happy.  Not only that, but the KERNEL has no
way of knowing that your data processing program intends to make
multiple passes on the data set, whether the working set is represented
by one file or several files, and even the data processing program itself
might not know (you might be running a script which runs a separate
program for each pass on the same data set).

So much for madvise().

So, no matter what, there will ALWAYS be an unhappy user somewhere.  Lets
take Mikhail's grep test as an example.  If he runs it over and over
again, should the kernel be 'optimized' to realize that the same data
set is being scanned sequentially, over and over again, ignore the
localized sequential nature of the data accesses, and just keep a
dedicated portion of that data set in memory to reduce long term
disk access?  Should it keep the first 1.5GB, or the last 1.5GB,
or perhaps it should slice the data set up and keep every other 256MB
block?  How does it figure out what to cache and when?  What if the
program suddenly starts accessing the data in a cacheable way?

Maybe it should randomly throw some of the data away slowly in the hopes
of 'adapting' to the access pattern, which would also require that it
throw away most of the 'recently read' data far more quickly to make
up for the data it isn't throwing away.  Believe it or not, that
actually works for certain types of problems, except then you get hung
up in a situation where two subsystems are competing with each other
for memory resources (like mail server verses web server), and the
system is unable to cope as the relative load factors for the competing
subsystems change.  The problem becomes really complex really fast.

This 

Re: more weird bugs with mmap-ing via NFS

2006-03-22 Thread Matthew Dillon
My guess is that you are exporting the filesystem as a particular
user id that is not root (i.e. you do not have -maproot=root: in the
exports line on the server).

What is likely happening is that the NFS client is trying to push out
the pages using the root uid rather then the user uid.  This is a highly
probable circumstance for VM pages because once they get disassociated
from the related buffer cache buffer, the cred information for the
last process to modify the related VM pages is lost.  When the kernel
tries to flush the pages out it winds up using root creds.

On DragonFly, I gave up entirely on trying to associate creds with
buffers.

I consider this more of a bug on the server side then on the client
side.  The server should automatically translate the root uid to the
exported uid for I/O ops.  Or, baring that, we have to add an option
to the client-side mount to be able to specify a user/group id to 
translate all I/O requests to.

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: more weird bugs with mmap-ing via NFS

2006-03-22 Thread Matthew Dillon
:So mmap is just a more reliable way to trigger this problem, right?
:
:Is not this, like, a major bug? A file can be opened, written to for a while, 
:and then -- at a semi-random moment -- the log will drop across the road? 
:Ouch...
:
:Thanks a lot to all concerned for helping solve this problem. Yours,
:
:   -mi

I consider it a bug.  I think the only way to reliably fix the problem
is to give the client the ability to specify the uid to issue RPCs with
in the NFS mount command, to match what the export does.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: flushing anonymous buffers over NFS is rejected by server (more weird bugs with mmap-ing via NFS)

2006-03-22 Thread Matthew Dillon

:So, the problem is, the dirtied buffers _sometimes_ lose their owner and thus 
:become root-owned. When the NFS client tries to flush them out, the NFS 
:server (by default suspecting remote roots of being evil) rejects the 
:flushing, which brings the client to its weak knees.
:
:1. Do the yet unflushed buffers really have to be anonymous?
:
:2. Can't the client's knees be strengthened in this regard?
:
:Thanks!
:
:   -mi

Basically correct, though its not the buffers that get lost, its that
the VM pages get disconnected from the buffers when the buffers are
recycled, then get reconnected (sans creds info) later on.

The basic answer is that we don't want to strengthen the client
with regards to buffer/VM page creds, because buffers and VM pages
are cached items in the system and can potentially have many 
different 'owners'.  The entire cred infrastructure for buffers
was a terrible hack put into place many years ago, solely to support NFS.
It created a huge mess in the system code and didn't even solve
the problem (as you found out).  I've already removed most of that junk
from DragonFly and I would argue that there isn't much point keeping it
in FreeBSD either.

The only real solution is to make the NFS client aware of the 
restricted user id exported by the server by requiring that the
same uid be specified in the mount command the client uses to
mount the NFS partition.  The NFS client would then use that user id
for all write I/O operations.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: flushing anonymous buffers over NFS is rejected by server (more weird bugs with mmap-ing via NFS)

2006-03-22 Thread Matthew Dillon
:What about different users accessing the same share from the same client?
:
:   -mi

   Yah, you're right.  That wouldn't work.  It would have to be a server-side
   solution.  Basically the server would have to accept root creds but 
   instead of translating them to a fixed uid it should allow the
   I/O operation to run as long as some non-root user would be able to
   do the I/O op.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: weird bugs with mmap-ing via NFS

2006-03-21 Thread Matthew Dillon

:
:   [Moved from -current to -stable]
:
:צ×ÔÏÒÏË 21 ÂÅÒÅÚÅÎØ 2006 16:23, Matthew Dillon ÷É ÎÁÐÉÓÁÌÉ:
: š š You might be doing just writes to the mmap()'d memory, but the system
: š š doesn't know that.
:
:Actually, it does. The program tells it, that I don't care to read, what's 
:currently there, by specifying the PROT_READ flag only.

That's an architectural flag.  Very few architectures actually support
write-only memory maps.  IA32 does not.  It does not change the
fact that the operating system must validate the memory underlying
the page, nor does it imply that the system shouldn't.

:Sounds like a missed optimization opportunity :-(

Even on architectures that did support write-only memory maps, the
system would still have to fault in the rest of the data on the page,
because the system would have no way of knowing which bytes in the 
page you wrote to (that is, whether you wrote to all the bytes in the
page or whether you left gaps).  The system does not take a fault for
every write you issue to the page, only for the first one.  So, no 
matter how you twist it, the system *MUST* validate the entire page
when it takes the page fault.

: š š It kinda sounds like the buffer cache is getting blown out, but not
: š š having seen the program I can't really analyze it.
:
:See http://aldan.algebra.com/~mi/mzip.c

I can't access this URL, it says 'not found'.

: š š It will always be more efficient to write to a file using write() then
: š š using mmap()
:
:I understand, that write() is much better optimized at the moment, but the 
:mmap interface carries some advantages, which may allow future OSes to 
:optimize their ways. The application can hint at its planned usage of the 
:data via madvise, for example.

Yes, but those advantages are limited by the way memory mapping hardware
works.  There are some things that simply cannot be optimized through
lack of sufficient information.

Reading via mmap() is very well optimized.  Making modifications via
mmap() is optimized insofar as the expectation that the data is intended
to be read, modified, and written back.  It is not possible to
optimize with the expectation that the data would only be written to
the mmap, for the reasons described above.  The hardware simply does not
provide sufficient information to the operating system to optimize 
the write-only case.

:Unfortunately, my problem, so far, is with it not writing _at all_...

Not sure what is going on since I can't access the program yet, but
I'd be happy to take a look at the code.

The most common mistake people make when trying to write to a file via
mmap() is that they forget to ftruncate() the file to the proper length
first.  Mapped memory beyond the file's EOF is ignored within the last
page, and the program will take a page fault if it tries to write to
mapped pages that are entire beyond the file's current EOF.  Writing
to mapped memory does *not* extend the size of a file.  Only 
ftruncate() or write() can extend the size of a file.

The second most common mistake is to forget to specify MAP_SHARED
in the mmap() call.

:Yes, this is an example of how a good implemented mmap can be better than 
:write. Without explicit writes by the application and without doubling the 
:memory requirements, the data can be written in the most optimal way.
:...
:Thanks for your help. Yours,
:
:   -mi

I don't think mmap()-based writing will EVER be more efficient then
write() except in the case where the entire data set fits into memory
and has been entirely cached by the system.  In that one case writing via
mmap will be faster.  In all other cases the system will be taking as
many VM faults on the pages as it would be taking system call faults
to execute the write()'s.

You are making a classic mistake by assuming that the copying overhead
of a write() into the file's backing store, verses directly mmap()ing
the file's backing store, represents a large chunk of the overhead for
the operation.  In fact, the copying overhead represents only a small
chunk of the related overhead.  The vast majority of the overhead is
always going to be the disk I/O itself.

I/O must occur even in the cached/delayed-write case so on a busy system
it still represents the greatest overhead from the point of view of
system load.  On a lightly loaded system nobody is going to care about
a few milliseconds of improved performance here and there since, by 
definition, the system is lightly loaded and thus has plenty of idle
cpu and I/O cycles to spare.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]

___
freebsd-stable@freebsd.org mailing list
http

Re: more weird bugs with mmap-ing via NFS

2006-03-21 Thread Matthew Dillon

:When the client is in this state it remains quite usable except for the 
:following:
:
:   1) Trying to start `systat 1 -vm' stalls ALL access to local disks,
:  apparently -- no new programs can start, and the running ones
:  can not access any data either; attempts to Ctrl-C the starting
:  systat succeed only after several minutes.
:
:   2) The writing process is stuck unkillable in the following state:
:
:   CPU PRI NI   VSZ   RSS MWCHAN STAT  TT   TIME
:   27  -4  0 1351368 137764 nfsDLp41:05,52
:
:  Sending it any signal has no effect. (Large sizes are explained
:  by it mmap-ing its large input and output.)
:
:   3) Forceful umount of the share, that the program is writing to,
:  paralyzes the system for several minutes -- unlike in 1), not
:  even the mouse is moving. It would seem, the process is dumping
:  core, but it is not -- when the system unfreezes, the only
:  message from the kernel is:
:
:   vm_fault: pager read error, pid  (mzip)
: 
:Again, this is on 6.1/i386 from today, which we are about to release into the 
:cruel world.
:
:Yours,
:
:   -mi

There are a number of problems using a block size of 65536.  First of
all, I think you can only safely do it if you use a TCP mount, also
assuming the TCP buffer size is appropriately large to hold an entire
packet.  For UDP mounts, 65536 is too large (the UDP data length can
only be 65536 bytes.  For that matter, the *IP* packet itself can 
not exceed 65535 bytes.  So 65536 will not work with a UDP mount.

The second problem is related to the network driver.  The packet MTU
is 1500, which means, typically, a limit of around 1460-1480 payload
bytes per packet.  A UDP large UDP packet that is, say, 48KB, will be
broken down into over 33 IP packet fragments.  The network stack could
very well drop some of these packet fragments making delivery of the 
overall UDP packet unreliable.

The NFS protocol itself does allow read and write packets to be
truncated providing that the read or write operation is either bounded
by the file EOF or (for a read) the remaining data is all zero's.  
Typically the all-zero's case is only optimized by the NFS server when
the underlying filesystem block itself is unallocated (i.e. a 'hole'
in the file).  In all other cases the full NFS block size is passed
between client and server.

I would stick to an NFS block size of 8K or 16K.  Frankly, there is
no real reason to use a larger block size.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: more weird bugs with mmap-ing via NFS

2006-03-21 Thread Matthew Dillon

:I don't specify either, but the default is UDP, is not it?

Yes, the default is UDP.

: Now imagine a client that experiences this problem only
: sometimes. Modern hardware, but for some reason (network
: congestion?) some frames are still lost if sent back-to-back.
: (Realtek chipset on the receiving side?)
:
:No, both sides have em-cards and are only separated by a rather decent large 
:switch.
:
:I'll try the TCP mount, workaround. If it helps, we can assume, our UDP NFS is 
:broken for sustained high bandwidth writes :-(
:
:Thanks!
:
:   -mi

I can't speak for FreeBSD's current implementation, but it should be
possible to determine whether there is an issue with packet drops or
not by observing the network statistics via netstat -s.  Generally
speaking, however, I know of no problems with a UDP NFS mount per-say,
at least as long reasonable values are chosen for the block size.

The mmap() call in your mzip.c program looks ok to me with the exception
of the use of PROT_WRITE.  Try using PROT_READ|PROT_WRITE.  The
ftruncate() looks ok as well.   If the program works over a local
filesystem but fails to produce data in the output file on an NFS
mount (but completes otherwise), then there is a bug in NFS somewhere.
If the problem is simply due to the program stalling, and not completing
due to the stalling, then it could be a problem with dropped packets
in the network stack.  If the problem is that the program simply runs
very inefficiently over NFS, with excessive network bandwidth for the
data being written (as you also reported), this is probably an artifact
of attempting to use mmap() to write out the data, for reasons previously
discussed.

I would again caution against using mmap() to populate a file in this
manner.  Even with MADV_SEQUENTIAL there is no guarentee that the system
will actually flush the pages to the actual file on the server
sequentially, and you could end up with a very badly fragmented file.
When a file is truncated to a larger size the underlying filesystem
does not allocate the actual backing store on disk for the data hole
created.  Allocation winds up being based on the order in which the
operating system flushes the VM pages.  The VM system does its best, but
it is really designed more as a random-access system rather then a 
sequential system.  Pages are flushed based on memory availability and
a thousand other factors and may not necessarily be flushed to the file
in the order you think they should be.  write() is really a much better
way to write out a sequential file (on any operating system, not
just BSD).

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: more weird bugs with mmap-ing via NFS

2006-03-21 Thread Matthew Dillon
:The file stops growing, but the network bandwidth remains at 20Mb/s. `Netstat 
:-s' on the client, had the following to say (udp and ip only):

If the network bandwidth is still going full bore then the program is
doing something.  NFS retries would not account for it.  A simple
test for that would be to ^Z the program once it gets into this state
and see if the network bandwidth goes to zero.

So if we assume that packets aren't being lost, then the question 
becomes: what is the program doing that is causing the network
bandwidth to go nuts?  And if it isn't the program, then what is the
OS doing that is causing the network bandwidth to go nuts?  

ktrace on the program would tell us if read() or write() or ftruncate()
were causing an issue.

'vmstat 1' while the program is running would tell us if VM faults
are creating an issue.

If neither of those are an issue then I would guess that the problem
could be related to the NFSv3 2-phase commit protocol.  A way to test
that would be to mount with NFSv2 and see if the problem still occurs.

Running tcpdump on the network interface while the program is in this 
state might also give us some valuable clues.  50 lines of output from
something like this after the program has gotten into its weird state
might give us a clue:

tcpdump -s 4096 -n -i interface -l port 2049

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: serious networking (em) performance (ggate and NFS) problem

2005-07-02 Thread Matthew Dillon
Polling should not produce any improvement over interrupts for EM0.
The EM0 card will aggregate 8-14+ packets per interrupt, or more.
which is only around 8000 interrupts/sec.  I've got a ton of these 
cards installed.

# mount_nfs -a 4 dhcp61:/home /mnt
# dd if=/mnt/x of=/dev/null bs=32k
# netstat -in 1
input(Total)   output
   packets  errs  bytespackets  errs  bytes colls
 66401 0   93668746   5534 0 962920 0
 66426 0   94230092   5537 01007108 0
 66424 0   93699848   5536 0 963268 0
 66422 0   94222372   5536 01007290 0
 66391 0   93654846   5534 0 962746 0
 66375 0   94154432   5532 01006404 0

  zfod   Interrupts
Proc:r  p  d  s  wCsw  Trp  Sys  Int  Sof  Fltcow8100 total
 19  62117   75 81004   12  88864 wire   7873 mux irq10
10404 act ata0 irq14
19.2%Sys   0.0%Intr  0.0%User  0.0%Nice 80.8%Idl   864476 inact   ata1 irq15
||||||||||  58152 cache   mux irq11
==   2992 free227 clk irq0


Note that the interrupt rate is only 7873 interrupts per second
while I am transfering 94 MBytes/sec over NFS (UDP) and receiving
over 66000 packets per second (~8 packets per interrupt).

If I use a TCP mount I get just about the same thing:

# mount_nfs -T -a 4 dhcp61:/home /mnt
# dd if=/mnt/x of=/dev/null bs=32k
# netstat -in 1

input(Total)   output
   packets  errs  bytespackets  errs  bytes colls
 61752 0   93978800   8091 0 968618 0
 61780 0   93530484   8098 0 904370 0
 61710 0   93917880   8093 0 968128 0
 61754 0   93491260   8095 0 903940 0
 61756 0   93986320   8097 0 968336 0


Proc:r  p  d  s  wCsw  Trp  Sys  Int  Sof  Fltcow8145 total
   5  8 22828   13 5490 8146   13   11 141556 wire   7917 mux irq10
 7800 act ata0 irq14
26.4%Sys   0.0%Intr  0.0%User  0.0%Nice 73.6%Idl   244872 inact   ata1 irq15
||||||||||  8 cache   mux irq11
=  630780 free228 clk irq0

In this case around 8000 interrupts per second with 61700 packet per
second incoming on the interface (around ~8 packets per interrupt).
The extra interrupts are due to the additional outgoing TCP ack traffic.

If I look at the systat -vm 1 output on the NFS server it also sees
only around 8000 interrupts per second, which isn't saying much other
then it's transmit path (61700 pps outoging) is not creating an undue
interrupt burden relative to the receive path.

-Matt

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to [EMAIL PROTECTED]

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Re[4]: serious networking (em) performance (ggate and NFS) problem

2004-11-22 Thread Matthew Dillon
:Increasing the interrupt moderation frequency worked on the re driver,
:but it only made it marginally better.  Even without moderation,
:however, I could lose packets without m_defrag.  I suspect that there is
:something in the higher level layers that is causing the packet loss.  I
:have no explanation why m_defrag makes such a big difference for me, but
:it does.  I also have no idea why a 20Mbps UDP stream can lose data over
:gigE phy and not lose anything over 100BT... without the above mentioned
:changes that is.

It kinda sounds like the receiver's UDP buffer is not large enough to
handle the burst traffic.  100BT is a much slower transport and the
receiver (userland process) was likely able drain its buffer before
new packets arrived.

Use netstat -s to observe the drop statistics for udp on both the
sender and receiver sides.  You may also be able to get some useful
information looking at the ip stats on both sides too.

Try bumping up net.inet.udp.recvspace and see if that helps.

In anycase, you should be able to figure out where the drops are occuring
by observing netstat -s output.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Re[2]: serious networking (em) performance (ggate and NFS) problem

2004-11-21 Thread Matthew Dillon
: I did simple benchmark at some settings.
:
: I used two boxes which are single Xeon 2.4GHz with on-boarded em.
: I measured a TCP throughput by iperf.
:
: These results show that the throughput of TCP increased if Interrupt
:Moderation is turned OFF. At least, adjusting these parameters affected
:TCP performance. Other appropriate combination of parameter may exist.

Very interesting, but the only reason you get lower results is simply
because the TCP window is not big enough.  That's it.

8000 ints/sec = ~15KB of backlogged traffic.  x 2 (sender, receiver)

Multiply by two (both the sender's reception of acks and the receiver's
reception of data) and you get ~30KB.  This is awefully close to the
default 32.5KB window size that iperf uses.

Other then window sizing issues I can think of no rational reason why
throughput would be lower.  Can you?  And, in fact, when I do the same
tests on DragonFly and play with the interrupt throttle rate I get
nearly the results I expect.

* Shuttle Athlon 64 3200+ box, EM card in 32 bit PCI slot 
* 2 machines connected through a GiGE switch
* All other hw.em0 delays set to 0 on both sides
* throttle settings set on both sides
* -w option set on iperf client AND server for 63.5KB window
* software interrupt throttling has been turned off for these tests

throttleresult  result
freq(32.5KB win)(63.5KB win)
(default)
--  ---

  maxrate   481 MBit/s  533 MBit/s  (not sure what's going on here)
  12518 MBit/s  558 MBit/s  (not sure what's going on here)
  10613 MBit/s  667 MBit/s  (not sure what's going on here)
   7679 MBit/s  691 MBit/s
   6668 MBit/s  694 MBit/s
   5678 MBit/s  684 MBit/s
   4694 MBit/s  696 MBit/s
   3694 MBit/s  696 MBit/s
   2698 MBit/s  703 MBit/s
   1707 MBit/s  716 MBit/s
9000708 MBit/s  716 MBit/s
8000710 MBit/s  717 MBit/s  --- drop off pt 32.5KB win
7000683 MBit/s  716 MBit/s
6000680 MBit/s  720 MBit/s
5000652 MBit/s  718 MBit/s  --- drop off pt 63.5KB win
4000555 Mbit/s  695 MBit/s
3000522 MBit/s  533 MBit/s  --- GiGE throttling likely
2000449 MBit/s  384 MBit/s  (256 ring descriptors =
1000260 MBit/s  193 MBit/s2500 hz minimum)

Unless you are in a situation where you need to route small packets
flying around a cluster where low latency is important, it doesn't really
make any sense to turn off interrupt throttling.  It might make sense 
to change the default from 8000 to 1 to handle typical default
TCP window sizes (at least in a LAN situation), but it certainly should
not be turned off.

I got some weird results when I increased the frequency past 100KHz, and
when I turned throttling off entirely.  I'm not sure why.  Maybe setting
the ITR register to 0 is a bad idea.  If I set it to 1 (i.e. 3906250 Hz)
then I get 625 MBit/s.  Setting the ITR to 1 (i.e. 256ns delay) should
amount to the same thing as setting it to 0 but it doesn't.  Very odd.
The maximum interrupt rate as reported by systat is only ~46000 ints/sec
so all the values above 50KHz should read about the same... and they
do until we hit around 100Khz (10uS delay).  Then everything goes to
hell in a handbasket.

Conclusion: 1 hz would probably be a better default then 8000 hz.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Re[4]: serious networking (em) performance (ggate and NFS) problem

2004-11-21 Thread Matthew Dillon
 idea.

In fact, even if you are just routing packets I would argue that turning
off moderation might not be a good choice... it might make more sense
to set it to some high frequency like 4 Hz.  But, of course, it
depends on what other things the machine might be running and what sort
of processing (e.g. firewall lists) the machine has to do on the packets.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: serious networking (em) performance (ggate and NFS) problem

2004-11-18 Thread Matthew Dillon
Polling should not produce any improvement over interrupts for EM0.
The EM0 card will aggregate 8-14+ packets per interrupt, or more.
which is only around 8000 interrupts/sec.  I've got a ton of these 
cards installed.

# mount_nfs -a 4 dhcp61:/home /mnt
# dd if=/mnt/x of=/dev/null bs=32k
# netstat -in 1
input(Total)   output
   packets  errs  bytespackets  errs  bytes colls
 66401 0   93668746   5534 0 962920 0
 66426 0   94230092   5537 01007108 0
 66424 0   93699848   5536 0 963268 0
 66422 0   94222372   5536 01007290 0
 66391 0   93654846   5534 0 962746 0
 66375 0   94154432   5532 01006404 0

  zfod   Interrupts
Proc:r  p  d  s  wCsw  Trp  Sys  Int  Sof  Fltcow8100 total
 19  62117   75 81004   12  88864 wire   7873 mux irq10
10404 act ata0 irq14
19.2%Sys   0.0%Intr  0.0%User  0.0%Nice 80.8%Idl   864476 inact   ata1 irq15
||||||||||  58152 cache   mux irq11
==   2992 free227 clk irq0


Note that the interrupt rate is only 7873 interrupts per second
while I am transfering 94 MBytes/sec over NFS (UDP) and receiving
over 66000 packets per second (~8 packets per interrupt).

If I use a TCP mount I get just about the same thing:

# mount_nfs -T -a 4 dhcp61:/home /mnt
# dd if=/mnt/x of=/dev/null bs=32k
# netstat -in 1

input(Total)   output
   packets  errs  bytespackets  errs  bytes colls
 61752 0   93978800   8091 0 968618 0
 61780 0   93530484   8098 0 904370 0
 61710 0   93917880   8093 0 968128 0
 61754 0   93491260   8095 0 903940 0
 61756 0   93986320   8097 0 968336 0


Proc:r  p  d  s  wCsw  Trp  Sys  Int  Sof  Fltcow8145 total
   5  8 22828   13 5490 8146   13   11 141556 wire   7917 mux irq10
 7800 act ata0 irq14
26.4%Sys   0.0%Intr  0.0%User  0.0%Nice 73.6%Idl   244872 inact   ata1 irq15
||||||||||  8 cache   mux irq11
=  630780 free228 clk irq0

In this case around 8000 interrupts per second with 61700 packet per
second incoming on the interface (around ~8 packets per interrupt).
The extra interrupts are due to the additional outgoing TCP ack traffic.

If I look at the systat -vm 1 output on the NFS server it also sees
only around 8000 interrupts per second, which isn't saying much other
then it's transmit path (61700 pps outoging) is not creating an undue
interrupt burden relative to the receive path.

-Matt

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


USENIX2004 photos online

2004-07-04 Thread Matthew Dillon
I took a bunch of a photos at USENIX, mainly of BSD related activities.
The photos are now online at:

http://apollo.backplane.com/USENIX2004/

-Matt

Matthew Dillon
[EMAIL PROTECTED]
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


USB bug fix for DETACH message.

2004-03-18 Thread Matthew Dillon
A DragonFly user noticed that usbd does not seem to get DETACH events
for UMASS devices. 

I tracked this down (in the FreeBSD-5 codebase for your convenience)
to line 1382 of usb_subr.c:

/*usbd_add_dev_event(USB_EVENT_DEVICE_DETACH, dev);*/

This line was apparently commented out by wpaul in rev 1.22, in
January 2000.

NetBSD has this line uncommented... that is, activated, and they
recently committed a bug fix (1.110 I believe in the NetBSD source
tree) that solves the problem I'm sure wpaul encountered that caused
him to comment the line out.  The bug fix is trivial.  Just above this
code, around line 1378, you simply need to NULL out dev-subdevs[i]
after detaching it:

printf( port %d, up-portno);
printf( (addr %d) disconnected\n, dev-address);
config_detach(dev-subdevs[i], DETACH_FORCE);
dev-subdevs[i] = NULL;   ADDME

If you want DETACH events to work, uncomment the add_dev_event and
make the bug fix above and DETACH events will work again.  If you are
going to do this, please do this in both FreeBSD-5.x and FreeBSD-4.x.

You may also wish to commit NetBSD's 1.111 of sub_subr.c, which reorders
an address assignment to work around certain non-compliant USB devices
and allow them to work.  The NetBSD repository can be accessed via 
CVS using the server string ':pserver:[EMAIL PROTECTED]:/cvsroot'
(don't accidently overwrite your FBsd tree, though! :-)).

-Matt

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: umass panic (after detaching/attaching card-reader 3 times)

2004-03-17 Thread Matthew Dillon
I just fixed a bug in DragonFly's UMASS/CAM interface.  DragonFly is
basically using the 4.x CAM code and the 5.x USB code.  The latest
CD ISO image does not yet reflect the change or I'd just say try burning
it and booting and see if you can get the system to screw up (I'll
generate a new ISO image tomorrow), but perhaps what I found can
serve as a hint to people working on FreeBSD.

In anycase, the bug had to do with the way UMASS detaches the CAM SIM.
What happens is that umass.c/USB_DETACH calls umass_cam_detach_sim(sc)
and then free's the softc.

The problem is that CAM may still have a bus scan timeout in progress
and a bus scan in the device queue (the device queue is destroyed by
umass_cam_detach_sim), and I believe it is also possible for UMASS to
have an operation in progress (initiated by CAM) which is racing
the detach operation.

When UMASS calls umass_cam_detach_sim() the CAM SIM gets ripped out 
from under the CAM bus structure but the queued timeout still needs
to indirect through the SIM so when the timeout happens, BOOM.  The
bug can also lead to lockups during boot... CAM installs an interrupt
completion item that the boot code waits for which scans all the CAM
busses, but if UMASS detaches the sim with ops still queued the bus
scan never completes and the system boot basically locks up forever
waiting for it to complete.  I was also able to easily lockup the
USB chipsets hard while diagnosing these bugs, to the point where I had
to physically unplug the machine to get it to work again.  I believe 
this was due to UMASS not properly aborting the pipes (leading to
a violation of pipe command serialization when talking to the USB
hardware).

The fixes I made to DragonFly were to ref-count the CAM SIM so it would
not be ripped out from under the CAM bus structure, to include CAM's
pending timeout in the ref-count of the CAM device structure so IT
wouldn't get ripped out if an active timeout exists, to abort all UMASS
pipes prior to detaching the sim, and to augment the the CAM XPT code's
AC_LOST_DEVICE path to: (1) clear out any pending timeouts and 
(2) flush all CAM software interrupts to make sure the async events 
have actually completed.

I don't how much of this applies to FreeBSD-5, since FreeBSD-5
seems to have rewritten a large chunk of CAM, but it does look like
some of it might apply.  Probably all of these issues apply to FreeBSD-4.

The patches are in the DragonFly CVS repository, related to the following
directories (in DFly): /usr/src/sys/bus/cam, /usr/src/sys/bus/usb,
and /usr/src/sys/dev/usbmisc, if I remember correctly.  
www.dragonflybsd.org.  Perhaps Julian, who has been working on the USB
code in 4.x can take it from there.  Information is the best I can offer.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]

:Jan Pechanec ([EMAIL PROTECTED]) wrote: 
:
:On Sat, 6 Mar 2004, Holger Kipp wrote:
:
:I experience a very repeatably but unwanted behaviour with umass/usb:
:
:System hangs/panics after detaching and attaching 8-in-1 Card Reader
:several times. Card Reader is attached to Cypress Semiconductor Slim
:Hub (ie not directly), but using the built-in hub give the same
:results.
:
:we have similar experince with some of our new boxes based on
:Via chipset. It seems to me that the attached device is innocent (same
:errors as yours - uhub port errors, umass detached, umass BBB reset
:failed etc.) and that the problem is somewhere on Via's side. We are
:still analysing the problem - but did you get any further since then?
:
:Unfortunately not - I was still 'waiting' for someone who deals with
:usb/umass to shed some light on the issue or asking the right questions.
:
:(What is worse is that I don't have the time to look into this right now.)
:
:The interesting thing is that umass detach seems to happen during
:the BBB-whatever-cycle such that the systems seems to end up using invalid
:nullpointers. This imho should never happen.
:
:Mar 6 20:43:41 katrin /kernel: umass0: BBB bulk-in clear stall failed, STALLED
:Mar 6 20:43:41 katrin /kernel: umass0: at uhub2 port 4 (addr 3) disconnected
:Mar 6 20:43:41 katrin /kernel: umass0: detached
:Mar 6 20:43:41 katrin /kernel: (null): BBB bulk-out clear stall failed, CANCELLED
:Mar 6 20:43:41 katrin /kernel: umass-sim:0:0:0:func_code 0x0901: Invalid target
:(target needed)
:Mar 6 20:43:41 katrin last message repeated 2 times
:Mar 6 20:43:41 katrin /kernel: panic: (null): Unknown state 0
:
:Unfortunately I don't have enough resources to test this with CURRENT. I am also
:waiting for MFC of umass which might fix a few things.
:
:Regards,
:Holger Kipp
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org

Re: umass panic (after detaching/attaching card-reader 3 times)

2004-03-17 Thread Matthew Dillon
:
:I would love to, if I could find hardware that reproduces the problem.  I
:went shopping for USB thumb drives a while back and only came up with
:working ones.
:
:I have a Soyo KT400 Dragon Lite machine at home.
:
:-- 
:Doug White|  FreeBSD: The Power to Serve
:[EMAIL PROTECTED]  |  www.FreeBSD.org

It took me a while to reproduce the panic from the first bug report
that was posted to our list, but I was finally able to do it by 
plugging in three or four USB mass storage devices all at once,
in various combinations before and after booting.  The key in reproducing
the bug is to get the system to detach a UMASS during the boot sequence,
that's what causes the (now disconnected) CAM bus rescan timeout to blow
up.  For some unknown reason the UMASS/USB subsystem will often attach/
deattach/reattach devices during boot, even though you haven't unplugged
them.  I don't know why... it still happens in DFly but it no longer
crashes.

I have brought the DFly ISO image up to date and added a README.USB
section and also an ehci.ko module (the rest of USB is already compiled
into GENERIC), which should be enough for anyone having this problem
to play around with the USB subsystem from a DragonFly ISO CD boot 
without having to touch their hard drive.  The ISO is available here
(around 70MB gzip'd, 200MB unzipped).  It's a fully live boot CD, just
login as root.  You can even edit things in /etc, /tmp, /dev, etc...
they are MFS mounts, of course.

ftp://ftp.dragonflybsd.org/iso-images/dfly-20040317b.iso.gz

I am curious if the people reporting the problem on FBsd-4 still have
the problem when the above DFly ISO.  If not then I would say that 
pretty much guarentees that my recent bug fixes (described in my previous
posting to this list) have dealt with the issue.  If so then I have more
work to do and would appreciate a bug report / dmesg output / backtrace
from DDB if possible :-)

p.s. the ISO fits on those cute mini CD-R's.

-Matt

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Problems reclaiming VM cache = XFree86 startup annoyance

2003-12-20 Thread Matthew Dillon

:...
: (Paul Mather)
: As you can probably gather, all this manual intervention is a bit of a
: hassle.  So, my question is this: is there a way explicitly to force
: the kernel to flush its VM cache (to move it to Free).  Failing
: that, are there any sysctls to tune to help alleviate the problem?
: The only sysctls I change in /etc/sysctl.conf are as follows:
:
:(DG)
:   I don't know what is causing your problem, but 'cache' pages in FreeBSD
:are free pages - they can be allocated directly in the page allocation code.
:They only differ from free pages in that they contain cached file data.
:   So the number of pages 'cache' vs. 'free' isn't the cause of the problem.
:...

Other disk activity can interfere with swap performance.  If these other
background jobs Paul is running are doing a lot of disk I/O or using a 
lot of memory, regardless of their nice value, they could be causing a
significant reduction in swap I/O performance and/or be causing page
thrashing.  The 'swwrt' state is waiting for an I/O to complete, *NOT*
waiting for a memory allocation, which implies an I/O performance issue.

It is likely that nothing is frozen per-say, just that I/O performance
is insufficient to handle the paging load caused by starting X *AND*
the I/O/memory load of whatever other background processes are running.
When you ^Z the background process, the I/O it is performing stops which
allows the paging I/O being caused by the X startup to get 100% of 
available disk bandwidth.  Also remember that in an I/O bound situation,
the 'nice' value of a process becomes irrelevant because there is plenty
of cpu available due to all the processes being predominantly in a
blocking state on I/O.

The easiest solution is to add more memory to the machine.  It is fairly
obvious to me that the machine does not have enough for the workload
being thrown at it.  Alternatively you may wish to examine the memory and
I/O footprint of these nice +20 processes and take steps to significantly
reduce both.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: POSIX_C_SOURCE

2003-08-30 Thread Matthew Dillon
This is precisely what I did a few days ago in DragonFly.  The warnings
were getting annoying.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]

:
:On Sat, Aug 30, 2003 at 12:49:15PM -0400, Garrett Wollman wrote:
: In article [EMAIL PROTECTED] you write:
: 
: Any chance that someone will finally commit the fixes to prevent the
: POSIX_C_SOURCE warnings from showing up? I saw a number of posts on this
: topic, but it still seems like it's not officially committed
: 
: /usr/include/sys/cdefs.h:273: warning: `_POSIX_C_SOURCE' is not defined
: /usr/include/sys/cdefs.h:279: warning: `_POSIX_C_SOURCE' is not defined
: 
: The warnings are wrong,[1] so you should probably ask the GCC people
: about that.
:
:The warnings are not wrong (see below), but anyway, since -stable uses
:the ancient GCC 2.95.4, the GCC people are not likely to
:give a damn in either case.  They usually don't care about the older
:releases.
:
: 
: -GAWollman
: 
: [1] That is to say, any identifier used in a preprocessor expression
: (after macro expansion) is defined to have a value of zero, and GCC
: should not be complaining about this.
:
:The code is correct, which is why GCC only gives a warning and not an
:error.  Code looking like that are usually an indication of a
:programmer error though, so GCC is perfectly right in warning about
:it.  This is similar to the compiler warning about unused variables,
:which isn't a bug either but often indicates a programmer mistake.
:
:To make gcc shut up, you can apply the following patch to cdefs.h
:which makes the warnings go away, without changing the semantics of the
:include file in any way.
:
:
:Index: cdefs.h
:===
:RCS file: /ncvs/src/sys/sys/cdefs.h,v
:retrieving revision 1.28.2.8
:diff -u -r1.28.2.8 cdefs.h
:--- cdefs.h18 Sep 2002 04:05:13 -  1.28.2.8
:+++ cdefs.h29 Jan 2003 21:23:30 -
:@@ -269,6 +269,8 @@
:  * Our macros begin with two underscores to avoid namespace screwage.
:  */
: 
:+#ifdef _POSIX_C_SOURCE
:+
: /* Deal with IEEE Std. 1003.1-1990, in which _POSIX_C_SOURCE == 1. */
: #if _POSIX_C_SOURCE == 1
: #undef _POSIX_C_SOURCE/* Probably illegal, but beyond caring now. */
:@@ -280,6 +282,8 @@
: #undef _POSIX_C_SOURCE
: #define   _POSIX_C_SOURCE 199209
: #endif
:+
:+#endif /* _POSIX_C_SOURCE */
: 
: /* Deal with various X/Open Portability Guides and Single UNIX Spec. */
: #ifdef _XOPEN_SOURCE
:
:
:
:
:-- 
:Insert your favourite quote here.
:Erik Trulsson
:[EMAIL PROTECTED]
:___
:[EMAIL PROTECTED] mailing list
:http://lists.freebsd.org/mailman/listinfo/freebsd-stable
:To unsubscribe, send any mail to [EMAIL PROTECTED]
:

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: system slowdown - vnode related

2003-05-27 Thread Matthew Dillon
I'm a little confused.  What state is the vnlru kernel thread in?  It
sounds like vnlru must be stuck.

Note that you can gdb the live kernel and get a stack backtrace of any
stuck process.

gdb -k /kernel.debug /dev/mem   (or whatever)
proc N  (e.g. vnlru's pid)
back

All the processes stuck in 'inode' are likely associated with the 
problem, but if that is what is causing vnlru to be stuck I would expect
vnlru itself to be stuck in 'inode'.

unionfs is probably responsible.  I would not be surprised at all if 
unionfs is causing a deadlock somewhere which is creating a chain of
processes stuck in 'inode' which is in turn causing vnlru to get stuck.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]

:
:On Mon, 26 May 2003, Mike Harding wrote:
:
: Er - are any changes made to RELENG_4_8 that aren't made to RELENG_4?  I
: thought it was the other way around - that 4_8 only got _some_ of the
: changes to RELENG_4...
:
:Ack, my fault ... sorry, wasn't thinking :(  RELENG_4 is correct ... I
:should have confirmed my settings before blathering on ...
:
:One of the scripts I used extensively while debugging this ... a quite
:simple one .. was:
:
:#!/bin/tcsh
:while ( 1 )
:  echo `sysctl debug.numvnodes` - `sysctl debug.freevnodes` - `sysctl 
debug.vnlru_nowhere` - `ps auxl | grep vnlru | grep -v grep | awk '{print $20}'`
:  sleep 10
:end
:
:which outputs this:
:
:debug.numvnodes: 463421 - debug.freevnodes: 220349 - debug.vnlru_nowhere: 3 - vlruwt
:
:I have my maxvnodes set to 512k right now ... now, when the server hung,
:the output would look something like (this would be with 'default' vnodes):
:
:debug.numvnodes: 199252 - debug.freevnodes: 23 - debug.vnlru_nowhere: 12 - vlrup
:
:with the critical bit being the vlruwt - vlrup change ...
:
:with unionfs, you are using two vnodes per file, instead of one in
:non-union mode, which is why I went to 512k vs the default of ~256k vnodes
:... it doesn't *fix* the problem, it only reduces its occurance ...
:___
:[EMAIL PROTECTED] mailing list
:http://lists.freebsd.org/mailman/listinfo/freebsd-stable
:To unsubscribe, send any mail to [EMAIL PROTECTED]
:

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: system slowdown - vnode related

2003-05-27 Thread Matthew Dillon

:I'll try this if I can tickle the bug again.
:
:I may have just run out of freevnodes - I only have about 1-2000 free
:right now.  I was just surprised because I have never seen a reference
:to tuning this sysctl.
:
:- Mike H.

The vnode subsystem is *VERY* sensitive to running out of KVM, meaning
that setting too high a kern.maxvnodes value is virtually guarenteed to
lockup the system under certain circumstances.  If you can reliably
reproduce the lockup with maxvnodes set fairly low (e.g. less then
100,000) then it ought to be easier to track the deadlock down.

Historically speaking systems did not have enough physical memory to
actually run out of vnodes.. they would run out of physical memory
first which would cause VM pages to be reused and their underlying
vnodes deallocated when the last page went away.  Hence the amount of
KVM being used to manage vnodes (vnode and inode structures) was kept
under control.

But today's Intel systems have far more physical memory relative to
available KVM and it is possible for the vnode management to run
out of KVM before the VM system runs out of physical memory.

The vnlru kernel thread is an attempt to control this problem but it
has had only mixed success in complex vnode management situations 
like unionfs where an operation on a vnode may cause accesses to
additional underlying vnodes.  In otherwords, vnlru can potentially
shoot itself in the foot in such situations while trying to flush out
vnodes.

-Matt

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: 4.7-R-p3: j.root-servers.net

2003-01-26 Thread Matthew Dillon

:On Sun, 26 Jan 2003, at 15:55 [=GMT-0800], Matthew Dillon wrote:
:
: #set hostname = 'ftp.alternic.net'
: #set remfile = 'db.root'
: #set locfile = 'db.root'
: set hostname = 'ftp.rs.internic.net'
: set remfile = domain/root.zone.gz
: set locfile = root.zone.gz
:
:Did you at some time change your root?

That's just old commented out stuff.  Probably from quite a long time
ago, I've been using some variation of this script for almost 10 years.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-stable in the body of the message



Re: question in residue in umass.c

2002-12-19 Thread Matthew Dillon

:I recently got my TREK Thumb drive working with -current. I was hoping to
:make similar changes to -STABLE to get it working. However, in -current,
:one of the quirks that had to be set IGNORE_RESIDUE. There does not
:seem to be any corresponding option in -STABLE. 
:
:If there is, could one of the USB Wizards get back to me on how to make
:the appropriate change? The -CURRENT settings are below, as a reference.
:
:   -Brian
:
:  { USB_VENDOR_TREK, USB_PRODUCT_TREK_THUMBDRIVE_8MB, RID_WILDCARD,
:UMASS_PROTO_ATAPI | UMASS_PROTO_BBB,
:IGNORE_RESIDUE
:  },

MFCing the quirk support, or at least most of it, is not that 
difficult and I will be doing it after my -current patches get
cleared up.

In the mean time you can use this patch for -stable.  Just add a quirk
entry for your TREK on top of this patch and there is a good chance
it will work.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]

Index: cam/scsi/scsi_da.c
===
RCS file: /home/ncvs/src/sys/cam/scsi/scsi_da.c,v
retrieving revision 1.42.2.29
diff -u -r1.42.2.29 scsi_da.c
--- cam/scsi/scsi_da.c  23 Nov 2002 23:21:42 -  1.42.2.29
+++ cam/scsi/scsi_da.c  19 Dec 2002 18:01:19 -
@@ -250,6 +250,16 @@
},
{
/*
+* Sony Key-Storage media fails in terrible ways without
+* both quirks.  The auto 6-10 code doesn't do the job.
+* (note: The Sony diskkey is actually the MSYSTEMS 
+* disk-on-key device).
+*/
+   {T_DIRECT, SIP_MEDIA_REMOVABLE, Sony, Storage Media, *},
+   /*quirks*/ DA_Q_NO_6_BYTE|DA_Q_NO_SYNC_CACHE
+   },
+   {
+   /*
 * Sony DSC cameras (DSC-S30, DSC-S50, DSC-S70)
 */
{T_DIRECT, SIP_MEDIA_REMOVABLE, Sony, Sony DSC, *},
Index: dev/usb/ohci.c
===
RCS file: /home/ncvs/src/sys/dev/usb/ohci.c,v
retrieving revision 1.39.2.7
diff -u -r1.39.2.7 ohci.c
--- dev/usb/ohci.c  6 Nov 2002 20:23:50 -   1.39.2.7
+++ dev/usb/ohci.c  20 Dec 2002 01:21:12 -
@@ -469,7 +469,7 @@
 
cur = std;
 
-   dataphysend = DMAADDR(dma, len - 1);
+   dataphysend = OHCI_PAGE(DMAADDR(dma, len - 1));
tdflags = 
(rd ? OHCI_TD_IN : OHCI_TD_OUT) | 
OHCI_TD_NOCC | OHCI_TD_TOGGLE_CARRY | 
@@ -484,8 +484,8 @@
 
/* The OHCI hardware can handle at most one page crossing. */
 #if defined(__NetBSD__) || defined(__OpenBSD__)
-   if (OHCI_PAGE(dataphys) == OHCI_PAGE(dataphysend) ||
-   OHCI_PAGE(dataphys) + OHCI_PAGE_SIZE == OHCI_PAGE(dataphysend))
+   if (OHCI_PAGE(dataphys) == dataphysend ||
+   OHCI_PAGE(dataphys) + OHCI_PAGE_SIZE == dataphysend)
 #elif defined(__FreeBSD__)
/* XXX This is pretty broken: Because we do not allocate
 * a contiguous buffer (contiguous in physical pages) we
@@ -493,7 +493,7 @@
 * So check whether the start and end of the buffer are on
 * the same page.
 */
-   if (OHCI_PAGE(dataphys) == OHCI_PAGE(dataphysend))
+   if (OHCI_PAGE(dataphys) == dataphysend)
 #endif
{
/* we can handle it in this TD */
@@ -510,6 +510,8 @@
/* must use multiple TDs, fill as much as possible. */
curlen = 2 * OHCI_PAGE_SIZE - 
 OHCI_PAGE_MASK(dataphys);
+   if (curlen  len)   /* may have fit in one page */
+   curlen = len;
 #elif defined(__FreeBSD__)
/* See comment above (XXX) */
curlen = OHCI_PAGE_SIZE - 
Index: dev/usb/umass.c
===
RCS file: /home/ncvs/src/sys/dev/usb/umass.c,v
retrieving revision 1.11.2.13
diff -u -r1.11.2.13 umass.c
--- dev/usb/umass.c 21 Nov 2002 21:26:14 -  1.11.2.13
+++ dev/usb/umass.c 20 Dec 2002 01:21:43 -
@@ -1488,6 +1488,7 @@
panic(%s: transferred %d bytes instead of %d bytes\n,
USBDEVNAME(sc-sc_dev),
sc-transfer_actlen, sc-transfer_datalen);
+#if 0
} else if (sc-transfer_datalen - sc-transfer_actlen
   != UGETDW(sc-csw.dCSWDataResidue)) {
DPRINTF(UDMASS_BBB, (%s: actlen=%d != residue=%d\n,
@@ -1497,6 +1498,7 @@
 
umass_bbb_reset(sc, STATUS_WIRE_FAILED);
return;
+#endif
 
} else if (sc-csw.bCSWStatus

Re: Kernel Panics in 4.7-STABLE

2002-10-13 Thread Matthew Dillon

The nexus_print_all_resources() panic is due to a bug in EISA bus
handling that shows up due to a recent commit John made.
He has a tentitive patch for it but it needs to be tested / verified.

I've included it below.  Pelase try this patch and tell us if it
fixes it.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]

:
:nexus_print_all_resources(c0e62280,c0e4a680,c0e62280,c0e62280,0) at 
:nexus_print_all_resources+0x14
:...


Index: nexus.c
===
RCS file: /usr/cvs/src/sys/i386/i386/nexus.c,v
retrieving revision 1.26.2.6
diff -u -r1.26.2.6 nexus.c
--- nexus.c 3 Mar 2002 05:42:49 -   1.26.2.6
+++ nexus.c 11 Oct 2002 18:07:45 -
@@ -219,21 +219,21 @@
 * connection points now so they show up on motherboard.
 */
if (!devclass_get_device(devclass_find(eisa), 0)) {
-   child = device_add_child(dev, eisa, 0);
+   child = BUS_ADD_CHILD(dev, 0, eisa, 0);
if (child == NULL)
panic(nexus_attach eisa);
device_probe_and_attach(child);
}
 #if NMCA  0
if (!devclass_get_device(devclass_find(mca), 0)) {
-   child = device_add_child(dev, mca, 0);
-   if (child == 0)
+   child = BUS_ADD_CHILD(dev, 0, mca, 0);
+   if (child == NULL)
panic(nexus_probe mca);
device_probe_and_attach(child);
}
 #endif
if (!devclass_get_device(devclass_find(isa), 0)) {
-   child = device_add_child(dev, isa, 0);
+   child = BUS_ADD_CHILD(dev, 0, isa, 0);
if (child == NULL)
panic(nexus_attach isa);
device_probe_and_attach(child);

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-stable in the body of the message



Re: Kernel Panics in 4.7-STABLE

2002-10-13 Thread Matthew Dillon

Oh, also, just a general note to people.  These bug reports are great,
it tells us that something went wrong, but it would be nice if they
were a little more complete :-)  For example, it wasn't until the 20th
posting in this and the similar other kernel panic thread on -hackers
that someone actually posted the console text (or DDB backtrace)
leading up to the crash, so we didn't connect it with the NEXUS issue
until just now.

-Matt

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-stable in the body of the message



Re: Setup routing entry for host with a non-local IP address

2002-10-09 Thread Matthew Dillon


: fxp0: flags=8843UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST mtu 1500
: inet 216.240.41.17 netmask 0xffc0 broadcast 216.240.41.63
: inet 10.0.0.2 netmask 0xff00 broadcast 10.0.0.255
: inet 216.240.41.21 netmask 0x broadcast 216.240.41.21
:
:That's what I said..  However, I would never use the above setup if
:it's supposed to be secure. Anyone with access to a machine in the
:41.1-41.62 range would be able to sniff the 10-net, which would not
:like. (maybe your setup allows for this, but I wouldn't mind the cost
:of a $6 el-cheapo NIC and a crosscable to get more secure, it's even
:cheaper than the time spend typing this mail ;-) ).

   Uhh.  I don't see how this can possibly make things more secure.  If
   the machine needs to be on both nets and someone breaks root on it,
   having a second NIC isn't going to save you.

:But in the case of two physical interfaces on the same (physical)
:segment, you get ARP errors. With aliases, you don't.
:
:Regards,
:
:Paul 

ARP errors?  Only if you try to configure the same IP address on
the two interfaces.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-stable in the body of the message



Re: [JUPITER] Fatal trap 12: page fault while in kernel mode (Was:Re: Woo hoo ... it crashed!! )

2002-09-05 Thread Matthew Dillon

(adding the general list back in!)

:
:On Thu, 5 Sep 2002, Matthew Dillon wrote:
:
: Whew!  These are big! :-).  I've got jupiter's files downloaded, now
: it's working on venus.
:
:Ya, both are 4gig servers :)  That was why netdump was so critical, cause
:they are also both production servers, so choking it back at 2gig *really*
:hurt ;)
:
   
Ok, you are running out of KVM!  Yes indeed, that is what is happening.
It must be happening quickly or that while loop diagnostic you did
would have caught it.

(kgdb) print kernel_vm_end
$77 = 0xff40

I'm not entirely sure but I believe SMP boxes reserve more page table
pages then non-SMP boxes (e.g. an extra segment or two, and each segment
represents 4MB of VM).  So this could be hitting the limit.

I'm going to dump a bunch of statistics first, then I'll analyize them:

(kgdb) zlist
0xc943e780  NFSNODE0 init + 56109152 dyn = 56109152
0xc943e800  NFSMOUNT   0 init +83776 dyn =83776
0xc92db580  PIPE   0 init +   799680 dyn =   799680
0xc92b4a80  SWAPMETA37282560 init +  1044480 dyn = 38327040
0xc92b4f80  unpcb  0 init +   40 dyn =   40
0xc9254000  ripcb2949120 init + 8064 dyn =  2957184
0xc9254080  syncache 2457440 init +16320 dyn =  2473760
0xc9254100  tcpcb8355840 init +  1114112 dyn =  9469952
0xc9254180  udpcb2949120 init +81792 dyn =  3030912
0xc9254200  socket   2949120 init +   794496 dyn =  3743616
0xc9254280  DIRHASH0 init +  2007040 dyn =  2007040
0xc9254300  KNOTE  0 init +12288 dyn =12288
0xc9032d00  VNODE  0 init + 45344256 dyn = 45344256
0xc9032d80  NAMEI  0 init +   139264 dyn =   139264
0xc6302a80  VMSPACE0 init +   700416 dyn =   700416
0xc6302b00  PROC   0 init +  1528800 dyn =  1528800
0xc0228e40  DP fakepg  0 init +0 dyn =0
0xc0239b40  PV ENTRY92320648 init + 28901124 dyn = 121221772
0xc0228fc0  MAP ENTRY  0 init +  5334624 dyn =  5334624
0xc0228f60  KMAP ENTRY  1218 init +   673776 dyn = 12853776
0xc0229020  MAP0 init + 1080 dyn = 1080
0xc022c700  VM OBJECT  0 init + 26998656 dyn = 26998656
TOTAL ZONE KMEM RESERVED: 132546560 init + 147156992 dynamic = 279703552

Memory statistics by type  Type  Kern
Type  InUse MemUse HighUse  Limit Requests Limit Limit Size(s)
linux 7 1K  1K102400K70 0  32
 NFS hash 1   512K512K102400K10 0  512K
  NQNFS Lease 1 1K  1K102400K10 0  1K
NFSV3 srvdesc28 1K  4K102400K3144846270 0  16,256
 NFSV3 diroff   14573K355K102400K418170 0  512
   NFS daemon69 7K  7K102400K   690 0  64,256,512
  NFS req 1 1K  3K102400K1573775730 0  64
  NFS srvsock 1 1K  1K102400K10 0  256
 atkbddev 2 1K  1K102400K20 0  32
  memdesc 1 4K  4K102400K10 0  4K
 mbuf 124K 24K102400K10 0  32K
   isadev 4 1K  1K102400K40 0  64
 ZONE16 2K  2K102400K   160 0  128
VM pgdata 1   128K128K102400K10 0  128K
file desc  3174   913K   1280K102400K   5035340 0  256,512,1K,2K,4K,8K
  UFS dirhash  1593   636K   1158K102400K   2234400 0  
16,32,64,128,256,512,1K,4K,8K
UFS mount1559K 59K102400K   150 0  512,2K,8K,32K
UFS ihash 1   512K512K102400K10 0  512K
 FFS node210654 52664K  58024K102400K1226980880 0  256
   dirrem   130 5K952K102400K   4808900 0  32
mkdir 0 0K  7K102400K 16120 0  32
   diradd   130 5K101K102400K   4181370 0  32
 freefile62 2K727K102400K   2488610 0  32
 freeblks7410K   2493K102400K   2133100 0  128
 freefrag 6 1K 23K102400K   1420180 0  32
   allocindir 1 1K289K102400K   4413310 0  64
 indirdep 2 1K 81K102400K132490 0  32,8K
  allocdirect24 2K124K102400K   3383100 0  64
bmsafemap26 1K  5K102400K   2040680 0  32
   newblk 1 1K  1K102400K   7796420 0  32,256
 inodedep   262   545K   4055K102400K   6011350 0  128,512K
  pagedep   175   139K277K102400K   2985940 0  64,128K
 p1003.1b 1 1K  1K102400K10 0  16
 syncache 1 8K

Re: squid and datasize kernel problems (was: openoffice stack and datasize kernel problems)

2002-07-29 Thread Matthew Dillon

:...
: It should be noted that mmap() uses whatever VM space
: remains after MAXDSIZ and MAXSSIZ have been reserved, so
: increasing MAXDSIZ reduces the amount of VM available
: to mmap().  Still, a 1 GB MAXDSIZ should not result in
: system utilities / servers running out of mmap() space! 
: Userland has 3G of VM space to play with.
:
:I'm running into some similar issues when trying to make Squid eat as
:much as possible of the 4 GB memory I have installed in a Compaq
:ProLiant DL 380 G2. At first, Squid seems to die and restart when trying
:to allocate memory above 512 MB. By tuning MAXDSIZ, I have made it use
:up to around 2 GB. If I set MAXDSIZ (I now do it in loader.conf with
:kern.maxdsiz) above around 2950 MB, init starts failing upon boot:
:
:init in malloc(): error mmap(2) failed, check limits
:init in malloc(): warning: recursive call
:
:Does anyone have any clues on how to overcome this? I'll be trying out
:the the dlmalloc library that is distributed with Squid, but I suppose
:I do need a 4 GB maximum data size to be able to make Squid actually use
:4 GB. Is this possible, or am I being totally foolish?
:
:Any hint very appreciated. :-)
:
:Cheers,
:
:-- 
:Anders.

0x  +---+-
|   |
|   KERNEL  | (1G)
|   |
0xC000  +---+-
|   |
|   USER STACK  |
|   |   |
|   V   |
|   |
+---+
|   |
|   |
|  AVAILABLE FOR MMAP   |
|   |  (stack, mmap, user data, program) = 3G
MAXDSIZ +---+
|   |
|   USER DATA (NON-MMAP)|
|   |
+---+
|   |
| PROGRAM BINARY|
|   |
0x  +---+-

Any C program which uses shared libraries uses mmap().  Many library
functions and libraries also use mmap(), including portions of our 
malloc() implementation (though the main area used by mmap is the
user data area).  If you increase MAXDSIZ to the point where there
is not enough VM for the mmap()'s programs make then you will run
into the problems you are having.

I'm not sure why you are trying to have squid use all 4G of the
machine directly in its user data area.  Squid caches a lot of
things in memory, sure, but it also caches things in files and
FreeBSD will use free physical memory to cache those files
regardless of how you configure the machine.  So you should be
getting good utilization of your 4G of memory even if Squid cannot
use all 4G in its user data area directly.   Setting MAXDSIZ to
2.9GB out of the 3G of user VM available puts a huge squeeze on
how much the program can mmap() before it runs out of VM.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]



To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-stable in the body of the message



Re: Abominable NFSv3 read performance / FreeBSD server / Solaris client

2002-07-23 Thread Matthew Dillon


:I'm experiencing terrible NFSv3 read performance between Solaris 8
:clients and a FreeBSD 4.6-STABLE server, which ran CVSup/make world
:a day ago and thus includes the 'em' driver update (though that made
:no difference).
:
:Other FreeBSD clients and Mac G3 boxes running MacOS X 10.1.x can do
:reads at fairly respectable rate - several MB/s over 100 Mbps ethernet.

High Olaf. 

:And now a (partial) tcpdump trace of the 'dd if=foo of=/dev/null bs=64k'
:from the Sun's perspective.
:...
:11:50:14.388682 freebsd.nfs  solaris.0: reply ok 1460 (DF)
:11:50:14.388769 freebsd.nfs  solaris.0: reply ok 1460 (DF)
:11:50:14.388786 solaris.1022  freebsd.nfsd: . ack 30660 win 24820 (DF)
(HERE)
:11:50:14.480613 solaris.1022  freebsd.nfsd: . ack 32120 win 24820 (DF)
:11:50:14.480892 freebsd.nfs  solaris.0: reply ok 780 (DF)
...
:11:50:14.482482 solaris.3454009260  freebsd.nfs: 180 read fh 979,451513/13278828 
:32768 bytes @ 0x1 (DF)
:11:50:14.482501 solaris.3454009261  freebsd.nfs: 180 read fh 979,451513/13278828 
:32768 bytes @ 0x18000 (DF)
:11:50:14.482520 solaris.3454009262  freebsd.nfs: 180 read fh 979,451513/13278828 
:32768 bytes @ 0x2 (DF)
:11:50:14.482539 solaris.3454009263  freebsd.nfs: 180 read fh 979,451513/13278828 
:32768 bytes @ 0x28000 (DF)

This isn't right.  A whole 1/10 second delay.

I'm going to point out something here.  Note that the
last packet FreeBSD sends is 780 bytes.  This corresponds
approximately to the last fragment of an 8K NFS read
operation.   Note that FreeBSD appears to be waiting
for an ACK from the solaris box before pushing the
last packet out.

This tells me that there is a serious window sizing
issue here.  Unfortunately the TCP trace is not detailed
enough to tell for sure, I really need to see the sequence
numbers for FreeBSD's transmissions as well as Solaris's
acks's, but there are several possibilities:

(a) The FreeBSD box does not have a large enough send
buffer.

(b) The Solaris box does not have a large enough receive
buffer.

or

(c) The buffers are large enough but the solaris box is
not properly handling RFC1323 (window scaling).

Please try the following:

(1) Disable rfc1323 (net.inet.tcp.rfc1323) and change
TCP's send buffer size to 65535 bytes (NOT 65536),
aka net.inet.tcp.sendspace=65535.  

Remember to killall -9 nfsd and restart nfsd on
the FreeBSD box after making these changes.

(2) Try specifying larger net.inet.tcp.sendspace's
with window scaling disabled.  Again remember
to killall -9 nfsd and restart it.

(3) If any of the above fixed the problem, try reenabling
window scaling (and restart nfsd's again).  If the
problem now occurs again the issue is that Solaris
is likely not doing window scaling properly.

Also during the dd, on the FreeBSD side please do a
'netstat -tn | fgrep tcp | fgrep the_proper_port' to
double check that FreeBSD is filling the TCP connection's
buffer as you expect for what you've set the buffer size
too.

The problem are these 1/10 second delays.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
   
...
:11:50:14.485255 freebsd.nfs  solaris.0: reply ok 1460 (DF)
:11:50:14.485378 freebsd.nfs  solaris.0: reply ok 1460 (DF)
:11:50:14.485510 freebsd.nfs  solaris.0: reply ok 1460 (DF)
(HERE)
:11:50:14.580651 solaris.1022  freebsd.nfsd: . ack 65020 win 24820 (DF)
:11:50:14.580910 freebsd.nfs  solaris.0: reply ok 780 (DF)
(HERE)
:11:50:14.680615 solaris.1022  freebsd.nfsd: . ack 65800 win 24820 (DF)
...
:11:50:14.680974 freebsd.nfs  solaris.3454009260: reply ok 1460 read (DF)
:11:50:14.681089 freebsd.nfs  solaris.0: reply ok 1460 (DF)

 Another 1/10 second delay.  In fact two sets of 1/10 second
 delays.


:11:50:14.682917 freebsd.nfs  solaris.0: reply ok 1460 (DF)
:11:50:14.683038 freebsd.nfs  solaris.0: reply ok 1460 (DF)
(HERE)
:11:50:14.780637 solaris.1022  freebsd.nfsd: . ack 97920 win 24820 (DF)
:11:50:14.780918 freebsd.nfs  solaris.0: reply ok 780 (DF)
(HERE)
:11:50:14.880647 solaris.1022  freebsd.nfsd: . ack 98700 win 24820 (DF)
:11:50:14.880994 freebsd.nfs  solaris.3454009261: reply ok 1460 read (DF)
:11:50:14.881110 freebsd.nfs  solaris.0: reply ok 1460 (DF)
:11:50:14.881134 solaris.1022  freebsd.nfsd: . ack 101620 win 24820 (DF)
...

And again.

...
:11:50:14.883416 freebsd.nfs  solaris.0: reply ok 1460 (DF)
:11:50:14.883543 freebsd.nfs  solaris.0: reply ok 1460 (DF)
(HERE)
:11:50:14.980685 solaris.1022  freebsd.nfsd: . ack 130820 win 24820 (DF)
:11:50:14.980965 freebsd.nfs  solaris.0: reply ok 780 (DF)
(HERE)
:11:50:15.080663 solaris.1022  freebsd.nfsd: . ack 131600 win 24820 (DF)
:11:50:15.081021 freebsd.nfs  solaris.3454009262: reply ok 1460 read (DF

  1   2   >