git commit: Doc: Clean up in cache architecture, added information about stripe assignment initialization.

amc Wed, 18 Dec 2013 12:56:47 -0800

Updated Branches:
  refs/heads/master 2b052f35b -> 0c953e129


Doc: Clean up in cache architecture, added information about stripe assignment 
initialization.


Project: http://git-wip-us.apache.org/repos/asf/trafficserver/repo
Commit: http://git-wip-us.apache.org/repos/asf/trafficserver/commit/0c953e12
Tree: http://git-wip-us.apache.org/repos/asf/trafficserver/tree/0c953e12
Diff: http://git-wip-us.apache.org/repos/asf/trafficserver/diff/0c953e12

Branch: refs/heads/master
Commit: 0c953e129225ed6db59a97d293d26ecae3f22c92
Parents: 2b052f3
Author: Alan M. Carroll <a...@network-geographics.com>
Authored: Wed Dec 18 14:55:20 2013 -0600
Committer: Alan M. Carroll <a...@network-geographics.com>
Committed: Wed Dec 18 14:55:20 2013 -0600

----------------------------------------------------------------------
 doc/arch/cache/cache-arch.en.rst | 148 ++++++++++++++++++++++------------
 doc/glossary.en.rst              |  42 +++++++++-
 2 files changed, 139 insertions(+), 51 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/trafficserver/blob/0c953e12/doc/arch/cache/cache-arch.en.rst
----------------------------------------------------------------------
diff --git a/doc/arch/cache/cache-arch.en.rst b/doc/arch/cache/cache-arch.en.rst
index 0018c3a..44b5a75 100644
--- a/doc/arch/cache/cache-arch.en.rst
+++ b/doc/arch/cache/cache-arch.en.rst
@@ -25,9 +25,8 @@ Introduction
 
 In addition to an HTTP proxy, |ATS| is also an HTTP cache. |TS| can cache any 
octet stream although it currently
 supports only those octet streams delivered by the HTTP protocol. When such a 
stream is cached (along with the HTTP
-protocol headers) it is termed an *object* in the cache. Each object is 
identified by a globally unique value called a
-*cache key*. By default this is generated by taking the `MD5 hash 
<http://www.openssl.org/docs/crypto/md5.html>`_ of the
-URI used to retrieve the content from the origin server.
+protocol headers) it is termed an :term:`object <cache object>` in the cache. 
Each object is identified by a globally
+unique value called a :term:`cache key`.
 
 The purpose of this document is to describe the basic structure and 
implementation details of the |TS| cache.
 Configuration of the cache will be discussed only to the extent needed to 
understand the internal mechanisms. This
@@ -41,20 +40,28 @@ different ways than they are used in the code in an attempt 
to create some consi
 Cache Layout
 ~~~~~~~~~~~~
 
-The first step in understanding cache operations is to understand the data 
structures and layout of the cache.
+The following sections describe how persistent cache data is structured. |TS| 
treats its persisent storage an
+undifferentiated collection of bytes, assuming no other structure to it. In 
particular it does not use the file system
+of the host operating system. If a file is used it is used only to mark out 
the set of bytes to be used.
 
 Cache storage
 =============
 
-The raw storage for the |TS| cache is configured in :file:`storage.config`. 
Each line in the file defines a :term:`cache span` which is treated as an 
undifferentiated unit of storage.
+The raw storage for the |TS| cache is configured in :file:`storage.config`. 
Each line in the file defines a :term:`cache
+span` which is treated as a uniform persistent store.
 
 .. figure:: images/cache-spans.png
    :align: center
 
    Two cache spans
 
-This storage organized in to a set of :term:`cache volume`\ s which are 
defined in :file:`volume.config`. Cache volumes
-can be defined by a percentage of the total storage or an absolute amount of 
storage. By default each cache volume is spread across all of the cache spans 
for robustness. The intersection of a cache volume and a cache span is a 
:term:`cache stripe`. Each cache span is divided in to cache stripes and each 
cache volume is a collection of those stripes.
+This storage organized in to a set of :term:`cache volume`\ s which are 
defined in :file:`volume.config` for the
+purposes of the administrator. These are the units that used for all other 
administator level configuration.
+
+Cache volumes can be defined by a percentage of the total storage or an 
absolute amount of storage. By default each
+cache volume is spread across all of the cache spans for robustness. The 
intersection of a cache volume and a cache span
+is a :term:`cache stripe`. Each cache span is divided in to cache stripes and 
each cache volume is a collection of those
+stripes.
 
 If the cache volumes for the example cache spans were defined as
 
@@ -66,12 +73,12 @@ then the actual layout would look like
 .. image:: images/cache-span-layout.png
    :align: center
 
-A cached object is stored entirely in a single stripe, and therefore in a 
single cache span - objects are never split
-across cache volumes. Objects are assigned to a stripe (and hence to a cache 
volume) automatically based on a hash of
-the URI used to retrieve the object from the origin server. It is possible to 
configure this to a limited extent in
-:file:`hosting.config` which supports content from specific host or domain to 
be stored on specific volumes. In
-addition, as of version 4.0.1 it is possible to control which cache spans (and 
hence, which cache stripes) are contained
-in a specific cache volume.
+Cache stripes are the fundamental unit of cache for the implementation. A 
cached object is stored entirely in a single
+stripe, and therefore in a single cache span - objects are never split across 
cache spans or volumes. Objects are
+assigned to a stripe (and hence to a cache volume) automatically based on a 
hash of the URI used to retrieve the object
+from the origin server. It is possible to configure this to a limited extent 
in :file:`hosting.config` which supports
+content from specific host or domain to be stored on specific cache volumes. 
In addition, as of version 4.0.1 it is
+possible to control which cache spans (and hence, which cache stripes) are 
contained in a specific cache volume.
 
 The layout and structure of the cache spans, the cache volumes, and the cache 
stripes that compose them are derived
 entirely from the :file:`storage.config` and :file:`cache.config` and is 
recomputed from scratch when the
@@ -108,9 +115,9 @@ Content in a stripe is tracked via a directory. We call 
each element of the dire
 represented by :cpp:class:`Dir`. Each entry refers to a chunk of contiguous 
storage in the cache. These are referred to
 variously as "fragments", "segments", "docs" / "documents", and a few other 
things. This document will use the term
 "fragment" as that is the most common reference in the code. The term "Doc" 
(for :cpp:class:`Doc`) will be used to refer
-to the header data for a fragment. Overall the directory is treated as a hash 
with a "cache ID" as the key. A cache ID
-is a 128 bit (16 byte) value generated in various ways depending on context. 
This ID is reduced and used as an index in
-to the directory to locate an entry in the standard way.
+to the header data for a fragment. Overall the directory is treated as a hash 
with the :term:`cache ID` as the key. See
+:ref:`directory probing <cache-directory-probe>` for how the cache ID is used 
to locate a directory entry. The cache ID
+is in turn computed from a :term:`cache key` which by default is the URL of 
the content.
 
 The directory is used as a memory resident structure which means a directory 
entry is as small as possible (currently 10
 bytes). This forces some compromises on the data that can be stored there. On 
the other hand this means that most cache
@@ -126,44 +133,61 @@ there is enough memory to run |TS| with an empty cache 
there is enough to run it
    :align: center
 
 Each entry stores an offset in the stripe and a size. The size stored in the 
directory entry is an :ref:`approximate
-size <dir-size>` which is at least as big as the actual data. Exact size data 
is stored in the fragment header on disk.
+size <dir-size>` which is at least as big as the actual data in the fragment. 
Exact size data is stored in the fragment
+header on disk.
 
 .. note::
 
-   Data in HTTP headers cannot be examined without disk I/O. This includes the 
original URL for the object. The original
-   source of the cache ID is not stored anywhere.
+   Data in HTTP headers cannot be examined without disk I/O. This includes the 
original URL for the object. The cache
+   key is not stored explicitly and therefore cannot be reliably retrieved.
+
+The directory is a hash table that uses `chaining
+<http://en.wikibooks.org/wiki/Data_Structures/Hash_Tables#Collision_resolution>`_
 for collision resolution. Because each
+entry is small they are used directly as the list header of the hash bucket.
 
 .. _dir-segment:
 .. _dir-bucket:
 
-The entries in a directory are grouped. The first level grouping is a 
*bucket*. This is a fixed number (currently 4 -
-defined as ``DIR_DEPTH``) of entries. The index generated from a cache ID is 
used as a bucket index (not an entry index).
-Buckets are grouped in to *segments*. All segments in a stripe have the same 
number of buckets. The number of segments
-in a stripe is chosen so that each segment has as many buckets as possible 
without exceeding 65535 (2\ :sup:`16`\ -1)
-entries in a segment. Note that all segments in the same stripe will have the 
same number of buckets.
+Chaining is implemented by imposing grouping structures on the entries in a 
directory. The first level grouping is a
+:term:`directory bucket`. This is a fixed number (currently 4 - defined as 
``DIR_DEPTH``) of entries. This
+serves to define the basic hash buckets with the first entry in each cache 
bucket serving as the root of the hash
+bucket.
+
+.. note::
+
+   The term "bucket" is used in the code to mean both the conceptual bucket 
for hashing and for a structural grouping
+   mechanism in the directory and so these will be qualified as needed to 
distinguish them. The unqualified term
+   "bucket" is almost always used to mean the structural grouping in the 
directory.
+
+Directory buckets are grouped in to :term:`segments <directory segment>`. All 
segments in a stripe have the same number of
+buckets. The number of segments in a stripe is chosen so that each segment has 
as many buckets as possible without
+exceeding 65535 (2\ :sup:`16`\ -1) entries in a segment.
 
 .. figure:: images/dir-segment-bucket.png
    :align: center
 
-Each entry has a previous and next index value which is used to link the 
entries in the same segment. The index size is
-16 bits which suffices to index any entry in the same segment. The stripe 
header contains an array of entry indices
-which are used as the roots of a free list. When a stripe is initialized the 
first entry in each bucket is zeroed
-(marked unused) and all other entries are put in the corresponding segment 
free list rooted in the stripe header. In
-essence the first element of each fixed bucket is used as a root for that 
bucket. The other entries in the fixed bucker
-are preferentially preferred for adding to that bucket but this is not 
required. The segment free lists are initialized
-such that the extra bucket entries are added in order - all the seconds, then 
the thirds, then the fourths. Because the
-free lists are FIFOs this means extra entries will be selected from the fourth 
entries across all the buckets first,
-then the thirds, etc. This maximizes locality for bucket searching.
+Each directory entry has a previous and next index value which is used to link 
entries in the same segment. Because no
+segment has more than 65535 entries 16 bits suffices for storing the index 
values. The stripe header contains an array
+of entry indices which are used as the roots of entry free lists, one for each 
segment. Active entries are stored via
+the bucket structure. When a stripe is initialized the first entry in each 
bucket is zeroed (marked unused) and all
+other entries are put in the corresponding segment free list in the stripe 
header. This means the first entry of each
+directory bucket is used as the root of a hash bucket and is therefore marked 
unused rather than being put a free list.
+The other entries in the directory bucket are preferentially preferred for 
adding to the corresponding hash bucket but
+this is not required. The segment free lists are initialized such that the 
extra bucket entries are added in order - all
+the seconds, then the thirds, then the fourths. Because the free lists are 
FIFOs this means extra entries will be
+selected from the fourth entries across all the buckets first, then the 
thirds, etc. When allocating a new directory
+entry in a bucket the entries are searched from first to last, which maximizes 
bucket locality (that is, cache IDs that
+map to the same hash bucket will also tend to use the same directory bucket).
 
 .. figure:: images/dir-bucket-assign.png
    :align: center
 
 Entries are removed from the free list when used and returned when no longer 
in use. When a fragment needs to be put in
-to the directory the cache ID is used to locate a bucket (which also 
determines the segment). If the first entry in the
-bucket is marked unused, it is used. If not then the other entries in the 
bucket are searched and if any are on the free
-list, that entry is used. If none are available then the first entry on the 
segment free list is used. This entry is
-attached to the bucket via the same next and previous indices used for the 
free list so that it can be found when doing
-a lookup of a cache ID.
+to the directory the cache ID is used to locate a hash bucket (which also 
determines the segment and directory bucket).
+If the first entry in the directory bucket is marked unused, it is used. If 
not then the other entries in the bucket are
+searched and if any are on the free list, that entry is used. If none are 
available then the first entry on the segment
+free list is used. This entry is attached to the hash bucket via the same next 
and previous indices used for the free
+list so that it can be found when doing a lookup of a cache ID.
 
 Storage Layout
 --------------
@@ -435,10 +459,6 @@ maximum size that is at least as large as the fragment. 
That will indicate the v
 granularity of the approximate size. That represents the largest possible 
amount of wasted disk I/O when the fragment is
 read from disk.
 
-.. note:: The cache index entry size is used only for reading the fragment 
from disk. The actual size on disk, and the
-amount of cache space consumed, is the actual size of the content rounded up 
to the disk sector size (default 512
-bytes).
-
 .. index:: DIR_DEPTH, index segment, index buckets
 
 The set of index entries for a volume are grouped in to *segments*. The number 
of segments for an index is selected so
@@ -450,7 +470,7 @@ standard hash table way, giving somewhat less than 2^14 
buckets per segment.
 
 .. [#] The comment in :file:`records.config` is simply wrong.
 
-.. _dir-probe:
+.. _cache-directory-probe:
 
 Directory Probing
 -----------------
@@ -552,7 +572,7 @@ There are three basic steps to a cache lookup.
 
    The cache key is used as a hash key in to an array of :cpp:class:`Vol` 
instances. The construction and arrangement of this array is the essence of how 
volumes are assigned.
 
-#. The cache stripe directory :ref:`is probed <dir-probe>` using the index key 
computed from the cache key.
+#. The cache stripe directory :ref:`is probed <cache-directory-probe>` using 
the index key computed from the cache key.
 
    Various other lookaside directories are checked as well, such as the 
:ref:`aggregation buffer <aggregation-buffer>`.
 
@@ -744,7 +764,7 @@ This interacts with the "one at a time" strategy of the 
aggregation write logic.
    I do not understand the extra key list that is present in an evacuation 
block. It is labeled as needed for
    "collisions" but I am unclear on what might be colliding. The bucket 
entries are stored and matched by stripe offset
    but if two fragments collide on their offset, only one can be valid. Based 
on how :ref:`directory probing
-   <dir-probe>` works and the logic of :cpp:func:`evacuate_fragments()` it 
appears that rather than determine which
+   <cache-directory-probe>` works and the logic of 
:cpp:func:`evacuate_fragments()` it appears that rather than determine which
    entry in a directory bucket is the correct one, all of them are marked for 
evacuation (thereby handling
    "collisions"). However, each one could have a distinct fragment size and 
that is set for all of the reads by the
    first fragment found in the directory. The intent seems to be to read all 
fragments that collide at the same starting
@@ -774,17 +794,45 @@ basically four types,
 * Disk
 * Raw device
 
-After setting all the `Span` instances they are grouped by device id to 
internal linked lists attached to the
+After creating all the `Span` instances they are grouped by device id to 
internal linked lists attached to the
 :cpp:member:`Store::disk` array [#]_. Spans that refer to the same directory, 
disk, or raw device are coalesced in to a
-single span. Spans that refer to the same file with overlapping offsets are 
also coalesced [#]_. This is all done in :c:func:`ink_cache_init()` called 
during startup.
+single span. Spans that refer to the same file with overlapping offsets are 
also coalesced [#]_. This is all done in
+:c:func:`ink_cache_init()` called during startup.
+
+.. note:: The span logic is also used by the HostDB and more than one 
otherwise inexplicable feature is provided by the span logic for that module.
 
 After configuration initialization the cache processor is started by calling 
:ccp:func:`CacheProcessor::start()`. This
 does a number of things.
 
 For each valid span, an instance of :cpp:class:`CacheDisk` is created. This 
class is a continuation and so can be used
-to perform potentially blocking operations on the span. This what is passed to 
the AIO threads to be called when an I/O
-operation completes. These are then dispatched to AIO threads to perform 
storage unit initialization. After all of those
-have completed, the resulting storage is distributed across the volumes in 
:c:func:`cplist_reconfigure`. The :cpp:class:`CacheVol` instances are created 
at this time.
+to perform potentially blocking operations on the span. The primary use of 
these is to be passed to the AIO threads as
+the callback when an I/O operation completes. These are then dispatched to AIO 
threads to perform storage unit
+initialization. After all of those have completed, the resulting storage is 
distributed across the volumes in
+:c:func:`cplist_reconfigure`. The :cpp:class:`CacheVol` instances are created 
at this time.
+
+Cache stripe assignment setup is done once all stripes have initialized (that 
is, the stripe header information has been
+successfully read from disk for all stripes). The assignment information is 
stored as an array of indices. These are
+indices in to an array of stripes. Both the assignment and the stripe arrays 
are stored in an instance of
+:cpp:class:`CacheHostRecord`. Assignment initialization consists of populating 
the assignment array which is much larger
+than the stripe array.
+
+There is an instance of :cpp:class:`CacheHostRecord` for each line in 
:file:`hosting.config` and one "generic" record.
+For the configured instances the set of stripes is determined from the cache 
volume specified in the line. If no lines are specified all stripes are placed 
in the generic record, otherwise only those stripes marked as default are 
placed in the generic record.
+
+.. note:: If hosting records are specified it is an error to not specify at 
least one default cache volume.
+
+The assignment table is initialized in :c:func:`build_vol_hash_table` which is 
called for each
+:cpp:class:`CacheHostRecord` instance. For each strip in the host record a 
sequence of pseudo-random numbers is
+generated, starting with the folded hash of the stripe hash identifier, which 
is the device path followed by the skip
+and size values for that stripe, making it unique. This also makes the 
sequence deterministic for any particular stripe.
+Each stripe gets one number in its sequence for every `VOL_HASH_ALLOC_SIZE` (8 
MB currently) of storage. These numbers are paired with
+the stripe index, combined across all stripes, then sorted by the random 
values. The resulting array is sampled for
+every slot in the stripe assignment table by dividing the maximum random value 
by the size of the assignment table and
+using the value midway between each multiple of the result of the division. 
The coalesced psuedo-random sequence is
+scanned for each sample in turn and the first number not greater than the 
sample is found. The stripe associated with
+that value is used for that assignment table entry.
+
+While this procedure is determinstic it is sensitive to initial conditions, 
including the size of each stripe.
 
 .. rubric:: Footnotes
 

http://git-wip-us.apache.org/repos/asf/trafficserver/blob/0c953e12/doc/glossary.en.rst
----------------------------------------------------------------------
diff --git a/doc/glossary.en.rst b/doc/glossary.en.rst
index e103fee..94996db 100644
--- a/doc/glossary.en.rst
+++ b/doc/glossary.en.rst
@@ -44,11 +44,34 @@ Glossary
 
    cache stripe
       A homogenous persistent store for the cache in a single :term:`cache 
span`. A stripe always resides
-      entirely on a single physical device and is treated as an 
undifferentiated span of bytes.
+      entirely on a single physical device and is treated as an 
undifferentiated span of bytes. This is the smallest
+      independent unit of storage.
 
    cache span
       The physical storage described by a single line in 
:file:`storage.config`.
 
+   cache key
+      A byte sequence that is a globally unique identifier for an 
:term:`object <cache object>` in the cache. By default
+      the URL for the object is used.
+
+   cache ID
+      A 128 bit value used as a fixed sized identifier for an object in the 
cache. This is computed from the
+      :term:`cache key` using the `MD5 hashing function 
<http://www.openssl.org/docs/crypto/md5.html>`_.
+
+   cache tag
+      The bottom few bits (12 currently) of the :term:`cache ID`. This is used 
in the :ref:`cache directory <cache-directory>`
+      for a preliminary identity check before going to disk.
+
+   cache object
+      The minimal self contained unit of data in the cache. Cache objects are 
the stored version of equivalent content
+      streams from an origin server. A single object can have multiple 
variants called :term:`alternates <alternate>`.
+
+   alternate
+      A variant of a :term:`cache object`. This was originally created to 
handle the `VARY mechanism
+      <http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.44>`_ but 
has since been used for additional
+      purposes. All alternates of an object must be equivalent in some manner, 
that is they are alternate forms of the
+      same stream. The most common example is having normal and compressed 
versions of the stream.
+
    storage unit
       Obsolete term for :term:`cache span`.
 
@@ -59,3 +82,20 @@ Glossary
 
    write cursor
       The location in a :term:`cache stripe` where new data is written.
+
+   directory segment
+      A contiguous group of :term:`buckets <directory bucket>`. Each 
:term:`cache stripe` has a set of segments all of
+      which have the same number of buckets, although the number of buckets 
per segment can vary between cache stripes.
+      Segments are administrative in purpose to minimize the size of free list 
and hash bucket pointers.
+
+   directory bucket
+      A contiguous fixed sized group of :term:`directory entries <directory 
entry>`. This is used for hash bucket
+      maintenance optimization.
+
+   directory entry
+      An in memory entry that describes a :term:`cache fragment`.
+
+   cache fragment
+      The unit of storage in the cache. All reads from the cache always read 
exactly one fragment. Fragments may be
+      written in groups, but every write is always an integral number of 
fragments. Each fragment has a corresponding
+      :term:`directory entry` which describes its location in the cache 
storage.

git commit: Doc: Clean up in cache architecture, added information about stripe assignment initialization.

Reply via email to