Madhavan, overall this is a very nice piece of work!
My comments for the design doc are marked by // in the first two columns:
Callout Scalability Design Document
-----------------------------------
3.8.1 ID bits
-------------
In the new design, each CPU will have its own callout tables. This means
that the number of bits for the table number will be proportional to the
number of CPUs. This will reduce the number of actual ID bits for larger
configurations. The potential issue here is that the less the number of ID
bits, the faster the ID will wrap. This is not an issue on 64-bit systems
because there are a lot of ID bits in a 64-bit ID.
// Can you explain how these bits allocations are configured? Do you use the
// maximum number of CPUs at boot time or actual number of CPUs? Do you perform
// any rounding on it? With all current CPU designs the number of CPUs may be
// quite high - is it an issue?
//
// Also, what happens if IDs do wrap? will it slow down execution or cause
// incorrect or missed timer firings?
For 32-bit systems, here is a SWAG. 3 bits are used for special flags
in the ID.
// What is a SWAG?
3.9 Expiration time
-------------------
Since we will use the cyclic subsystem directly, we need to use hrtime for
expiration (not ticks). But what gets passed in as the interval to
timeout() and realtime_timeout() is ticks. The conversion from ticks
to hrtime involves multiplication and the conversion from hrtime to ticks
involves division. Multiplication and divison operations have become
a lot cheaper these days. So, this is not a performance concern.
// Multiplication is, probably, not a concern. Division may be more of an issue.
// Do you need to perform a lot of hrtime->ticks conversions?
3.11 DR/CPU offline/online
--------------------------
No special processing is needed for CPU offline.
// What resources are consumed by the offlined CPU in the new design? Are there
// taskq threads hanging around for it? In case of DR a CPU may be physically
// removed from the system - is anything special done in this case?
3.12 hrtime and lbolt
---------------------
The current design uses the lbolt to drive callout expiration.
The new design will use hrtime and the cyclic subsystem. This
has ramifications for situations like dropping into KMDB (or OBP).
When the system drops into KMDB and comes back, the current
design just continues from where it left off with
lbolt updates and callout scheduling. The new design, however,
will expire all the timeouts that expired while the system was in
the debugger. This is a fundamental behavior change. We need to test
with network work loads to see if this becomes an issue with
TCP retransmission.
// How would this affect the semantics of deadman timer? Can it cause the firing
// of deadman timer after KMDB session? Also this is not only a KMDB issue. It
// becomes much more important when Solaris runs under virtualized environment
// where virtual machines can be suspended/resumed etc. In terms of networking
// connections it is probably Ok to expire timers normally, but other cases
// (like deadman) are more interesting.
3.13 New Callout API
--------------------
// Are these new APIs going to be included in the DDI or they will remain ON
// private?
Two new callout API functions will be implemented in the new design.
timeout_id_t
timeout_generic(int type, void (*func)(void *), void *arg,
hrtime_t expiration, hrtime_t resolution, int flags)
timeout() and realtime_timeout() just call this internally.
Arguments:
type CALLOUT_REALTIME or CALLOUT_NORMAL
This specifies the type of timeout.
// I would suggest choosing names with TIMEOUT in them rather than CALLOUT. The
// interface deals with timeouts, not callouts. The same applies to the flags.
CALLOUT_FLAG_ROUNDUP. This specifies that the expiration should be
rounded up to the nearest resolution boundary. Without this, the
expiration is rounded down.
// Are there consumers who require both types of rounding? Who needs round downs
// and who needs round ups?
// Also, why not delegate the rounding to the caller and avoid any rounding in
// timeout_generic()? The caller may easily decide the granularity needed.
CALLOUT_FLAG_HRESTIME. This flag creates a callout that is affected
by hrestime (the system date/time). When someone changes the system
time (e.g., via stime()), these callouts are to expire immediately
and return to their caller. These are used only by cv_waituntil_sig().
// Why is it called FLAG_HRESTIME? The name is misleading.
// BTW, why cv_waituntil_sig expects its timeout to expire in this case?
hrtime_t
untimeout_generic(timeout_id_t)
untimeout() just calls this internally and converts hrtime to ticks.
So the only difference is the return value, right?
3.14 Cyclic subsystem changes
-----------------------------
We need changes to the cyclic subsystem API to accommodate the
new callout design.
- We need a new API function to reprogram cyclics. When applied to
a regular cyclic, it will find the CPU to which the cyclic is
currently attached and send an X-call there to accomplish the
reprogramming. When applied to an omni-cyclic, it will do it
on the current CPU's component of the omni-cyclic. To avoid
the overhead of X-calls, the caller should be on the same
CPU for a regular cyclic.
// It is not clear whether the caller should be on the same CPU for performance
// reasons or the caller must be on the same CPU for correct behaviour and the
// code may ASSERT that it is on the same CPU.
- We need a cyclic handler to have the ability to reprogram its cyclic.
The cyclic handler could just call the new API. However, the
cyclic subsystem documentation says that cyclic handlers should
never call the cyclic API functions. We can approach this in one
of two ways:
1. Call the new reprogram API from the handler. The reprogram API
does not use the CPU lock. It ensures safety by raising the
PIL to HILEVEL. So, I believe it is safe to call the function
from the handlers. So, change the documentation to indicate
this.
// Is it sufficient to raise the PIL? This document doesn't provide enough
// background on synchronisation used by the callout/cyclics.
3.15 Observability changes
--------------------------
Really nice additions!
3.15.1 MDB dcmd callout
-----------------------
The current design supports the mdb dcmd "callout". This simply dumps
all the callouts in the system and is not very useful.
Changes will be made to the dcmd to:
- reflect data structure changes
- provide many meaningful options to print just the interesting data
Here is a description:
NAME
callout - display callouts
SYNOPSIS
[ addr ] ::callout
[-r|n] [-s|l] [-x|u] [-t|a|b tick [-d]] [-C | -S seqid] [-f name|addr] [-T|L
[-E]] [-FivV]
// Is it possible/easy to also be able to limit by specific argument passed to
// timeout? Is it possible to apply several filtering options at once?
3.15.2 New MDB dcmd calloutid
-----------------------------
A new dcmd calloutid has been introduced to dump the state of a callout,
given its ID.
Here is a description:
NAME
calloutid - print callout by extended id
SYNOPSIS
::calloutid [-d|v] xid
// I'd suggest to do it in the normal MDB style -
// xid ::calloutid
// This will allow using this command in a pipe.
3.16.3 Per-Lgroup kmem_cache
----------------------------
The kmem caches will be created per Hierarchical Lgroup. For NUMA like
systems such as starcat, the memory for the callouts and callout lists
would come from memory that is close to the CPUs when the kmem layer's
locality awareness is improved. Plus the cache pages would only
be mapped in the TLBs of those CPUs. The data would only be present
in those CPU caches. This is futuristic.
// Can you explain how you achieve memory locality using per-lgroup kmem caches?
// What is allocated from these caches? How is it freed?
// Why do you need lgroup handle in the callout_cache_t structure?
3.16.4 Callout table allocation
-------------------------------
Callout tables are allocated one at a time in the current design. In the
new design, they will be allocated in one single allocation. The advantage
is that if large pages are used for the kernel heap, then callout tables
will get allocated in large pages. This will reduce the number of TLB
entries required to address the tables.
// How do you know how many tables to allocate? Do you allocate for all
// possible CPUs?