[PATCH] docs: improve readability for people with poorer eyesight

2018-10-04 Thread Darrick J. Wong
Hi,

So my eyesight still hasn't fully recovered, so in the meantime it's
been difficult to read the online documentation.  Here's some stylesheet
overrides I've been using to make it easier for me to read them:
https://djwong.org/docs/kdoc/index.html

---
From: Darrick J. Wong 

My eyesight is not in good shape, which means that I have difficulty
reading the online Linux documentation.  Specifically, body text is
oddly small compared to list items and the contrast of various text
elements is too low for me to be able to see easily.

Therefore, alter the HTML theme overrides to make the text larger and
increase the contrast for better visibility, and trust the typeface
choices of the reader's browser.

For the PDF output, increase the text size, use a sans-serif typeface
for sans-serif text, and use a serif typeface for "roman" serif text.

Signed-off-by: Darrick J. Wong 
---
 Documentation/conf.py   |6 ++--
 Documentation/sphinx-static/theme_overrides.css |   38 +++
 2 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/Documentation/conf.py b/Documentation/conf.py
index e45d4d2b5adb..03dd6dc135ea 100644
--- a/Documentation/conf.py
+++ b/Documentation/conf.py
@@ -259,7 +259,7 @@ latex_elements = {
 'papersize': 'a4paper',
 
 # The font size ('10pt', '11pt' or '12pt').
-'pointsize': '8pt',
+'pointsize': '11pt',
 
 # Latex figure (float) alignment
 #'figure_align': 'htbp',
@@ -272,8 +272,8 @@ latex_elements = {
 'preamble': '''
% Use some font with UTF-8 support with XeLaTeX
 \\usepackage{fontspec}
-\\setsansfont{DejaVu Serif}
-\\setromanfont{DejaVu Sans}
+\\setsansfont{DejaVu Sans}
+\\setromanfont{DejaVu Serif}
 \\setmonofont{DejaVu Sans Mono}
 
  '''
diff --git a/Documentation/sphinx-static/theme_overrides.css 
b/Documentation/sphinx-static/theme_overrides.css
index 522b6d4c49d4..e21e36cd6761 100644
--- a/Documentation/sphinx-static/theme_overrides.css
+++ b/Documentation/sphinx-static/theme_overrides.css
@@ -4,6 +4,44 @@
  *
  */
 
+/* Improve contrast and increase size for easier reading. */
+
+body {
+   font-family: serif;
+   color: black;
+   font-size: 100%;
+}
+
+h1, h2, .rst-content .toctree-wrapper p.caption, h3, h4, h5, h6, legend {
+   font-family: sans-serif;
+}
+
+.wy-menu-vertical li.current a {
+   color: #505050;
+}
+
+.wy-menu-vertical li.on a, .wy-menu-vertical li.current > a {
+   color: #303030;
+}
+
+div[class^="highlight"] pre {
+   font-family: monospace;
+   color: black;
+   font-size: 100%;
+}
+
+.wy-menu-vertical {
+   font-family: sans-serif;
+}
+
+.c {
+   font-style: normal;
+}
+
+p {
+   font-size: 100%;
+}
+
 /* Interim: Code-blocks with line nos - lines and line numbers don't line up.
  * see: https://github.com/rtfd/sphinx_rtd_theme/issues/419
  */


[PATCH 1/2] docs: move ext4 administrative docs to admin-guide/

2018-10-04 Thread Darrick J. Wong
From: Darrick J. Wong 

Move the ext4 mount option and other administrative stuff to the Linux
administrator's guide.

Signed-off-by: Darrick J. Wong 
---
 Documentation/admin-guide/ext4.rst   |  574 ++
 Documentation/admin-guide/index.rst  |1 
 Documentation/conf.py|2 
 Documentation/filesystems/ext4/ext4.rst  |  574 --
 Documentation/filesystems/ext4/index.rst |1 
 5 files changed, 577 insertions(+), 575 deletions(-)
 create mode 100644 Documentation/admin-guide/ext4.rst
 delete mode 100644 Documentation/filesystems/ext4/ext4.rst


diff --git a/Documentation/admin-guide/ext4.rst 
b/Documentation/admin-guide/ext4.rst
new file mode 100644
index ..e506d3dae510
--- /dev/null
+++ b/Documentation/admin-guide/ext4.rst
@@ -0,0 +1,574 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+
+ext4 General Information
+
+
+Ext4 is an advanced level of the ext3 filesystem which incorporates
+scalability and reliability enhancements for supporting large filesystems
+(64 bit) in keeping with increasing disk capacities and state-of-the-art
+feature requirements.
+
+Mailing list:  linux-e...@vger.kernel.org
+Web site:  http://ext4.wiki.kernel.org
+
+
+Quick usage instructions
+
+
+Note: More extensive information for getting started with ext4 can be
+found at the ext4 wiki site at the URL:
+http://ext4.wiki.kernel.org/index.php/Ext4_Howto
+
+  - The latest version of e2fsprogs can be found at:
+
+https://www.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/
+
+   or
+
+http://sourceforge.net/project/showfiles.php?group_id=2406
+
+   or grab the latest git repository from:
+
+   https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git
+
+  - Create a new filesystem using the ext4 filesystem type:
+
+# mke2fs -t ext4 /dev/hda1
+
+Or to configure an existing ext3 filesystem to support extents:
+
+   # tune2fs -O extents /dev/hda1
+
+If the filesystem was created with 128 byte inodes, it can be
+converted to use 256 byte for greater efficiency via:
+
+# tune2fs -I 256 /dev/hda1
+
+  - Mounting:
+
+   # mount -t ext4 /dev/hda1 /wherever
+
+  - When comparing performance with other filesystems, it's always
+important to try multiple workloads; very often a subtle change in a
+workload parameter can completely change the ranking of which
+filesystems do well compared to others.  When comparing versus ext3,
+note that ext4 enables write barriers by default, while ext3 does
+not enable write barriers by default.  So it is useful to use
+explicitly specify whether barriers are enabled or not when via the
+'-o barriers=[0|1]' mount option for both ext3 and ext4 filesystems
+for a fair comparison.  When tuning ext3 for best benchmark numbers,
+it is often worthwhile to try changing the data journaling mode; '-o
+data=writeback' can be faster for some workloads.  (Note however that
+running mounted with data=writeback can potentially leave stale data
+exposed in recently written files in case of an unclean shutdown,
+which could be a security exposure in some situations.)  Configuring
+the filesystem with a large journal can also be helpful for
+metadata-intensive workloads.
+
+Features
+
+
+Currently Available
+---
+
+* ability to use filesystems > 16TB (e2fsprogs support not available yet)
+* extent format reduces metadata overhead (RAM, IO for access, transactions)
+* extent format more robust in face of on-disk corruption due to magics,
+* internal redundancy in tree
+* improved file allocation (multi-block alloc)
+* lift 32000 subdirectory limit imposed by i_links_count[1]
+* nsec timestamps for mtime, atime, ctime, create time
+* inode version field on disk (NFSv4, Lustre)
+* reduced e2fsck time via uninit_bg feature
+* journal checksumming for robustness, performance
+* persistent file preallocation (e.g for streaming media, databases)
+* ability to pack bitmaps and inode tables into larger virtual groups via the
+  flex_bg feature
+* large file support
+* inode allocation using large virtual block groups via flex_bg
+* delayed allocation
+* large block (up to pagesize) support
+* efficient new ordered mode in JBD2 and ext4 (avoid using buffer head to force
+  the ordering)
+
+[1] Filesystems with a block size of 1k may see a limit imposed by the
+directory hash tree having a maximum depth of two.
+
+Options
+===
+
+When mounting an ext4 filesystem, the following option are accepted:
+(*) == default
+
+  ro
+Mount filesystem read only. Note that ext4 will replay the journal (and
+thus write to the partition) even when mounted "read only". The mount
+options "ro,noload" can be used to prevent writes to the filesystem.
+
+  journal_checksum
+Enable checksumming of t

[PATCH 0/2] ext4: even more documentation fixes

2018-10-04 Thread Darrick J. Wong
Hi all,

This series fixes some problems that were brought up during review for
the XFS documentation which I hadn't known about when pushing the ext4
documentation during the 4.19 cycle.

The first patch moves the ext4 mount option and sysfs knob information
into the Linux administration guide.

The second patch moves the ext4 Data Structures & Algorithms book up a
level since it's a self-contained documentation book.

I've built the docs and put them here, in case you hate reading rst:
https://djwong.org/docs/kdoc/filesystems/ext4/index.html

The patchset should apply cleanly against 4.19-rc6.  Comments and
questions are, as always, welcome.

--D


[PATCH 22/22] docs: add XFS metadump structure to DS book

2018-10-03 Thread Darrick J. Wong
From: Darrick J. Wong 

Signed-off-by: Darrick J. Wong 
---
 .../filesystems/xfs-data-structures/auxiliary.rst  |2 +
 .../filesystems/xfs-data-structures/metadump.rst   |   72 
 2 files changed, 74 insertions(+)
 create mode 100644 Documentation/filesystems/xfs-data-structures/metadump.rst


diff --git a/Documentation/filesystems/xfs-data-structures/auxiliary.rst 
b/Documentation/filesystems/xfs-data-structures/auxiliary.rst
index d2fd2f88ad0e..7ed970f0bc28 100644
--- a/Documentation/filesystems/xfs-data-structures/auxiliary.rst
+++ b/Documentation/filesystems/xfs-data-structures/auxiliary.rst
@@ -2,3 +2,5 @@
 
 Auxiliary Data Structures
 =
+
+.. include:: metadump.rst
diff --git a/Documentation/filesystems/xfs-data-structures/metadump.rst 
b/Documentation/filesystems/xfs-data-structures/metadump.rst
new file mode 100644
index ..51bc966c1f76
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/metadump.rst
@@ -0,0 +1,72 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Metadata Dumps
+--
+
+The xfs\_metadump and xfs\_mdrestore tools are used to create a sparse
+snapshot of a live file system and to restore that snapshot onto a block
+device for debugging purposes. Only the metadata are captured in the snapshot,
+and the metadata blocks may be obscured for privacy reasons.
+
+A metadump file starts with a xfs\_metablock that records the addresses of the
+blocks that follow. Following that are the metadata blocks captured from the
+filesystem. The first block following the first superblock must be the
+superblock from AG 0. If the metadump has more blocks than can be pointed to
+by the xfs\_metablock.mb\_daddr area, the sequence of xfs\_metablock followed
+by metadata blocks is repeated.
+
+**Metadata Dump Format.**
+
+.. code:: c
+
+struct xfs_metablock {
+__be32  mb_magic;
+__be16  mb_count;
+uint8_t mb_blocklog;
+uint8_t mb_reserved;
+__be64  mb_daddr[];
+};
+
+**mb\_magic**
+The magic number, "XFSM" (0x5846534d).
+
+**mb\_count**
+Number of blocks indexed by this record. This value must not exceed (1 <<
+mb\_blocklog) - sizeof(struct xfs\_metablock).
+
+**mb\_blocklog**
+The log size of a metadump block. This size of a metadump block 512 bytes,
+so this value should be 9.
+
+**mb\_reserved**
+Reserved. Should be zero.
+
+**mb\_daddr**
+An array of disk addresses. Each of the mb\_count blocks (of size (1 <<
+mb\_blocklog) following the xfs\_metablock should be written back to the
+address pointed to by the corresponding mb\_daddr entry.
+
+Dump Obfuscation
+
+
+Unless explicitly disabled, the xfs\_metadump tool obfuscates empty block
+space and naming information to avoid leaking sensitive information into the
+metadump file. xfs\_metadump does not copy user data blocks.
+
+The obfuscation policy is as follows:
+
+-  File and extended attribute names are both considered "names".
+
+-  Names longer than 8 characters are totally rewritten with a name that
+   matches the hash of the old name.
+
+-  Names between 5 and 8 characters are partially rewritten to match the hash
+   of the old name.
+
+-  Names shorter than 5 characters are not obscured at all.
+
+-  Names that cross a block boundary are not obscured at all.
+
+-  Extended attribute values are zeroed.
+
+-  Empty parts of metadata blocks are zeroed.



[PATCH 20/22] docs: add XFS extended attributes structures to the DS book

2018-10-03 Thread Darrick J. Wong
From: Darrick J. Wong 

Signed-off-by: Darrick J. Wong 
---
 .../filesystems/xfs-data-structures/dynamic.rst|1 
 .../xfs-data-structures/extended_attributes.rst|  933 
 2 files changed, 934 insertions(+)
 create mode 100644 
Documentation/filesystems/xfs-data-structures/extended_attributes.rst


diff --git a/Documentation/filesystems/xfs-data-structures/dynamic.rst 
b/Documentation/filesystems/xfs-data-structures/dynamic.rst
index 2c12fca905fd..16755381d0f8 100644
--- a/Documentation/filesystems/xfs-data-structures/dynamic.rst
+++ b/Documentation/filesystems/xfs-data-structures/dynamic.rst
@@ -6,3 +6,4 @@ Dynamic Allocated Structures
 .. include:: ondisk_inode.rst
 .. include:: data_extents.rst
 .. include:: directories.rst
+.. include:: extended_attributes.rst
diff --git 
a/Documentation/filesystems/xfs-data-structures/extended_attributes.rst 
b/Documentation/filesystems/xfs-data-structures/extended_attributes.rst
new file mode 100644
index ..db6de15227cd
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/extended_attributes.rst
@@ -0,0 +1,933 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Extended Attributes
+---
+
+Extended attributes enable users and administrators to attach (name: value)
+pairs to inodes within the XFS filesystem. They could be used to store
+meta-information about the file.
+
+Attribute names can be up to 256 bytes in length, terminated by the first 0
+byte. The intent is that they be printable ASCII (or other character set)
+names for the attribute. The values can contain up to 64KB of arbitrary binary
+data. Some XFS internal attributes (eg. parent pointers) use non-printable
+names for the attribute.
+
+Access Control Lists (ACLs) and Data Migration Facility (DMF) use extended
+attributes to store their associated metadata with an inode.
+
+XFS uses two disjoint attribute name spaces associated with every inode. These
+are the root and user address spaces. The root address space is accessible
+only to the superuser, and then only by specifying a flag argument to the
+function call. Other users will not see or be able to modify attributes in the
+root address space. The user address space is protected by the normal file
+permissions mechanism, so the owner of the file can decide who is able to see
+and/or modify the value of attributes on any particular file.
+
+To view extended attributes from the command line, use the getfattr command.
+To set or delete extended attributes, use the setfattr command. ACLs control
+should use the getfacl and setfacl commands.
+
+XFS attributes supports three namespaces: "user", "trusted" (or "root" using
+IRIX terminology), and "secure".
+
+See the section about `extended attributes <#extended-attribute-versions>`__
+in the inode for instructions on how to calculate the location of the
+attributes.
+
+The following four sections describe each of the on-disk formats.
+
+Short Form Attributes
+~
+
+When the all extended attributes can fit within the inode’s attribute fork,
+the inode’s di\_aformat is set to "local" and the attributes are stored in
+the inode’s literal area starting at offset di\_forkoff × 8.
+
+Shortform attributes use the following structures:
+
+.. code:: c
+
+typedef struct xfs_attr_shortform {
+ struct xfs_attr_sf_hdr {
+   __be16   totsize;
+   __u8 count;
+ } hdr;
+ struct xfs_attr_sf_entry {
+   __uint8_tnamelen;
+   __uint8_tvaluelen;
+   __uint8_tflags;
+   __uint8_tnameval[1];
+ } list[1];
+} xfs_attr_shortform_t;
+typedef struct xfs_attr_sf_hdr xfs_attr_sf_hdr_t;
+typedef struct xfs_attr_sf_entry xfs_attr_sf_entry_t;
+
+**totsize**
+Total size of the attribute structure in bytes.
+
+**count**
+The number of entries that can be found in this structure.
+
+**namelen** and **valuelen**
+These values specify the size of the two byte arrays containing the name
+and value pairs. valuelen is zero for extended attributes with no value.
+
+**nameval[]**
+A single array whose size is the sum of namelen and valuelen. The names
+and values are not null terminated on-disk. The value immediately follows
+the name in the array.
+
+.. _attribute-flags:
+
+**flags**
+A combination of the following:
+
+.. list-table::
+   :widths: 28 52
+   :header-rows: 1
+
+   * - Flag
+ - Description
+
+   * - 0
+ - The attribute's namespace is "user".
+
+   * - XFS_ATTR_ROOT
+ - The attribute's namespace is "trusted".
+
+   * - XFS_ATTR_SECURE
+ - The attribute's namespace is "secure".
+
+   * - XFS_ATTR_INCOMPLETE
+ - This attribute is being modified.
+
+   * - XFS_ATTR_LOCAL
+ - The attribute value is contained within this b

[PATCH 21/22] docs: add XFS symlink structures to the DS book

2018-10-03 Thread Darrick J. Wong
From: Darrick J. Wong 

Signed-off-by: Darrick J. Wong 
---
 .../filesystems/xfs-data-structures/dynamic.rst|1 
 .../xfs-data-structures/symbolic_links.rst |  140 
 2 files changed, 141 insertions(+)
 create mode 100644 
Documentation/filesystems/xfs-data-structures/symbolic_links.rst


diff --git a/Documentation/filesystems/xfs-data-structures/dynamic.rst 
b/Documentation/filesystems/xfs-data-structures/dynamic.rst
index 16755381d0f8..68837d0f477e 100644
--- a/Documentation/filesystems/xfs-data-structures/dynamic.rst
+++ b/Documentation/filesystems/xfs-data-structures/dynamic.rst
@@ -7,3 +7,4 @@ Dynamic Allocated Structures
 .. include:: data_extents.rst
 .. include:: directories.rst
 .. include:: extended_attributes.rst
+.. include:: symbolic_links.rst
diff --git a/Documentation/filesystems/xfs-data-structures/symbolic_links.rst 
b/Documentation/filesystems/xfs-data-structures/symbolic_links.rst
new file mode 100644
index ..9206fd44b108
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/symbolic_links.rst
@@ -0,0 +1,140 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Symbolic Links
+--
+
+Symbolic links to a file can be stored in one of two formats: "local" and
+"extents". The length of the symlink contents is always specified by the
+inode’s di\_size value.
+
+Short Form Symbolic Links
+~
+
+Symbolic links are stored with the "local" di\_format if the symbolic link
+can fit within the inode’s data fork. The link data is an array of characters
+(di\_symlink array in the data fork union).
+
+.. figure:: images/61.png
+   :alt: Symbolic link short form layout
+
+   Symbolic link short form layout
+
+xfs\_db Short Form Symbolic Link Example
+
+
+A short symbolic link to a file is created:
+
+::
+
+xfs_db> inode 
+xfs_db> p
+core.magic = 0x494e
+core.mode = 0120777
+core.version = 1
+core.format = 1 (local)
+...
+core.size = 12
+core.nblocks = 0
+core.extsize = 0
+core.nextents = 0
+...
+u.symlink = "small_target"
+
+Raw on-disk data with the link contents highlighted:
+
+::
+
+xfs_db> type text
+xfs_db> p
+00: 49 4e a1 ff 01 01 00 01 00 00 00 00 00 00 00 00 IN..
+10: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 01 
+20: 44 be e1 c7 03 c4 d4 18 44 be el c7 03 c4 d4 18 D...D...
+30: 44 be e1 c7 03 c4 d4 18 00 00 00 00 00 00 00 Oc D...
+40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
+50: 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 
+60: ff ff ff ff 73 6d 61 6c 6c 5f 74 61 72 67 65 74 small.target
+70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
+
+Extent Symbolic Links
+~
+
+If the length of the symbolic link exceeds the space available in the inode’s
+data fork, the link is moved to a new filesystem block and the inode’s
+di\_format is changed to "extents". The location of the block(s) is
+specified by the data fork’s di\_bmx[] array. In the significant majority of
+cases, this will be in one filesystem block as a symlink cannot be longer than
+1024 characters.
+
+On a v5 filesystem, the first block of each extent starts with the following
+header structure:
+
+.. code:: c
+
+struct xfs_dsymlink_hdr {
+ __be32sl_magic;
+ __be32sl_offset;
+ __be32sl_bytes;
+ __be32sl_crc;
+ uuid_tsl_uuid;
+ __be64sl_owner;
+ __be64sl_blkno;
+ __be64sl_lsn;
+};
+
+**sl\_magic**
+Specifies the magic number for the symlink block: "XSLM" (0x58534c4d).
+
+**sl\_offset**
+Offset of the symbolic link target data, in bytes.
+
+**sl\_bytes**
+Number of bytes used to contain the link target data.
+
+**sl\_crc**
+Checksum of the symlink block.
+
+**sl\_uuid**
+The UUID of this block, which must match either sb\_uuid or sb\_meta\_uuid
+depending on which features are set.
+
+**sl\_owner**
+The inode number that this symlink block belongs to.
+
+**sl\_blkno**
+Disk block number of this symlink.
+
+**sl\_lsn**
+Log sequence number of the last write to this block.
+
+Filesystems formatted prior to v5 do not have this header in the remote block.
+Symlink data begins immediately at offset zero.
+
+.. figure:: images/62.png
+   :alt: Symbolic link extent layout
+
+   Symbolic link extent layout
+
+xfs\_db Symbolic Link Extent Example
+
+
+A longer link is created (greater than 156 bytes):
+
+::
+
+xfs_db> inode 
+xfs_db> p
+core.magic = 0x494e
+core.mode = 0120777
+core.version = 1
+   

[PATCH 17/22] docs: add XFS inode format to the DS book

2018-10-03 Thread Darrick J. Wong
From: Darrick J. Wong 

Signed-off-by: Darrick J. Wong 
---
 .../filesystems/xfs-data-structures/dynamic.rst|2 
 .../xfs-data-structures/ondisk_inode.rst   |  558 
 2 files changed, 560 insertions(+)
 create mode 100644 
Documentation/filesystems/xfs-data-structures/ondisk_inode.rst


diff --git a/Documentation/filesystems/xfs-data-structures/dynamic.rst 
b/Documentation/filesystems/xfs-data-structures/dynamic.rst
index 895c94e95889..945b07be2034 100644
--- a/Documentation/filesystems/xfs-data-structures/dynamic.rst
+++ b/Documentation/filesystems/xfs-data-structures/dynamic.rst
@@ -2,3 +2,5 @@
 
 Dynamic Allocated Structures
 
+
+.. include:: ondisk_inode.rst
diff --git a/Documentation/filesystems/xfs-data-structures/ondisk_inode.rst 
b/Documentation/filesystems/xfs-data-structures/ondisk_inode.rst
new file mode 100644
index ..77ecd2917489
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/ondisk_inode.rst
@@ -0,0 +1,558 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+On-Disk Inode
+-
+
+All files, directories, and links are stored on disk with inodes and descend
+from the root inode with its number defined in the
+`superblock <#superblocks>`__. The previous section on `AG Inode
+Management <#ag-inode-management>`__ describes the allocation and management
+of inodes on disk. This section describes the contents of inodes themselves.
+
+An inode is divided into 3 parts:
+
+.. figure:: images/23.png
+   :alt: On-disk inode sections
+
+   On-disk inode sections
+
+-  The core contains what the inode represents, stat data, and information
+   describing the data and attribute forks.
+
+-  The di\_u "data fork" contains normal data related to the inode. Its
+   contents depends on the file type specified by di\_core.di\_mode (eg.
+   regular file, directory, link, etc) and how much information is contained
+   in the file which determined by di\_core.di\_format. The following union to
+   represent this data is declared as follows:
+
+.. code:: c
+
+union {
+ xfs_bmdr_block_t di_bmbt;
+ xfs_bmbt_rec_t   di_bmx[1];
+ xfs_dir2_sf_tdi_dir2sf;
+ char di_c[1];
+ xfs_dev_tdi_dev;
+ uuid_t   di_muuid;
+ char di_symlink[1];
+} di_u;
+
+-  The di\_a "attribute fork" contains extended attributes. Its layout is
+   determined by the di\_core.di\_aformat value. Its representation is
+   declared as follows:
+
+.. code:: c
+
+union {
+ xfs_bmdr_block_t di_abmbt;
+ xfs_bmbt_rec_t   di_abmx[1];
+ xfs_attr_shortform_t di_attrsf;
+} di_a;
+
+-   The above two unions are rarely used in the XFS code, but the structures
+within the union are directly cast depending on the di\_mode/di\_format
+and di\_aformat values. They are referenced in this document to make it
+easier to explain the various structures in use within the inode.
+
+The remaining space in the inode after di\_next\_unlinked where the two forks
+are located is called the inode’s "literal area". This starts at offset
+100 (0x64) in a version 1 or 2 inode, and offset 176 (0xb0) in a version 3
+inode.
+
+The space for each of the two forks in the literal area is determined by the
+inode size, and di\_core.di\_forkoff. The data fork is located between the
+start of the literal area and di\_forkoff. The attribute fork is located
+between di\_forkoff and the end of the inode.
+
+Inode Core
+~~
+
+The inode’s core is 96 bytes on a V4 filesystem and 176 bytes on a V5
+filesystem. It contains information about the file itself including most stat
+data information about data and attribute forks after the core within the
+inode. It uses the following structure:
+
+.. code:: c
+
+struct xfs_dinode_core {
+ __uint16_tdi_magic;
+ __uint16_tdi_mode;
+ __int8_t  di_version;
+ __int8_t  di_format;
+ __uint16_tdi_onlink;
+ __uint32_tdi_uid;
+ __uint32_tdi_gid;
+ __uint32_tdi_nlink;
+ __uint16_tdi_projid;
+ __uint16_tdi_projid_hi;
+ __uint8_t di_pad[6];
+ __uint16_tdi_flushiter;
+ xfs_timestamp_t   di_atime;
+ xfs_timestamp_t   di_mtime;
+ xfs_timestamp_t   di_ctime;
+ xfs_fsize_t   di_size;
+ xfs_rfsblock_tdi_nblocks;
+ xfs_extlen_t  di_extsize;
+ xfs_extnum_t  di_nextents;
+ xfs_aextnum_t di_anextents;
+ __uint8_t di_forkoff;
+ __int8_t  di_aformat;
+ __uint32_t   

[PATCH 16/22] docs: add preliminary XFS realtime rmapbt structures to the DS book

2018-10-03 Thread Darrick J. Wong
From: Darrick J. Wong 

Signed-off-by: Darrick J. Wong 
---
 .../xfs-data-structures/internal_inodes.rst|2 
 .../filesystems/xfs-data-structures/rtrmapbt.rst   |  230 
 2 files changed, 232 insertions(+)
 create mode 100644 Documentation/filesystems/xfs-data-structures/rtrmapbt.rst


diff --git a/Documentation/filesystems/xfs-data-structures/internal_inodes.rst 
b/Documentation/filesystems/xfs-data-structures/internal_inodes.rst
index 4c3a1bf1f822..0faf58caf8f6 100644
--- a/Documentation/filesystems/xfs-data-structures/internal_inodes.rst
+++ b/Documentation/filesystems/xfs-data-structures/internal_inodes.rst
@@ -206,3 +206,5 @@ rtbitmap location, and positive if there are any.
 This data structure is not particularly space efficient, however it is a very
 fast way to provide the same data as the two free space B+trees for regular
 files since the space is preallocated and metadata maintenance is minimal.
+
+.. include:: rtrmapbt.rst
diff --git a/Documentation/filesystems/xfs-data-structures/rtrmapbt.rst 
b/Documentation/filesystems/xfs-data-structures/rtrmapbt.rst
new file mode 100644
index ..1573ec4f09ec
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/rtrmapbt.rst
@@ -0,0 +1,230 @@
+Real-Time Reverse-Mapping B+tree
+
+
+**Note**
+
+This data structure is under construction! Details may change.
+
+If the reverse-mapping B+tree and real-time storage device features are
+enabled, the real-time device has its own reverse block-mapping B+tree.
+
+As mentioned in the chapter about `reconstruction 
<#metadata-reconstruction>`__, this
+data structure is another piece of the puzzle necessary to reconstruct the
+data or attribute fork of a file from reverse-mapping records; we can also use
+it to double-check allocations to ensure that we are not accidentally
+cross-linking blocks, which can cause severe damage to the filesystem.
+
+This B+tree is only present if the XFS\_SB\_FEAT\_RO\_COMPAT\_RMAPBT feature
+is enabled and a real time device is present. The feature requires a version 5
+filesystem.
+
+The real-time reverse mapping B+tree is rooted in an inode’s data fork; the
+inode number is given by the sb\_rrmapino field in the superblock. The B+tree
+blocks themselves are stored in the regular filesystem. The structures used
+for an inode’s B+tree root are:
+
+.. code:: c
+
+struct xfs_rtrmap_root {
+ __be16 bb_level;
+ __be16 bb_numrecs;
+};
+
+-  On disk, the B+tree node starts with the xfs\_rtrmap\_root header followed
+   by an array of xfs\_rtrmap\_key values and then an array of
+   xfs\_rtrmap\_ptr\_t values. The size of both arrays is specified by the
+   header’s bb\_numrecs value.
+
+-  The root node in the inode can only contain up to 10 key/pointer pairs for
+   a standard 512 byte inode before a new level of nodes is added between the
+   root and the leaves. di\_forkoff should always be zero, because there are
+   no extended attributes.
+
+Each record in the real-time reverse-mapping B+tree has the following
+structure:
+
+.. code:: c
+
+struct xfs_rtrmap_rec {
+ __be64 rm_startblock;
+ __be64 rm_blockcount;
+ __be64 rm_owner;
+ __be64 rm_fork:1;
+ __be64 rm_bmbt:1;
+ __be64 rm_unwritten:1;
+ __be64 rm_unused:7;
+ __be64 rm_offset:54;
+};
+
+**rm\_startblock**
+Real-time device block number of this record.
+
+**rm\_blockcount**
+The length of this extent, in real-time blocks.
+
+**rm\_owner**
+A 64-bit number describing the owner of this extent. This must be an inode
+number, because the real-time device is for file data only.
+
+**rm\_fork**
+If rm\_owner describes an inode, this can be 1 if this record is for an
+attribute fork. This value will always be zero for real-time extents.
+
+**rm\_bmbt**
+If rm\_owner describes an inode, this can be 1 to signify that this record
+is for a block map B+tree block. In this case, rm\_offset has no meaning.
+This value will always be zero for real-time extents.
+
+**rm\_unwritten**
+A flag indicating that the extent is unwritten. This corresponds to the
+flag in the `extent record <#data-extents>`__ format which means
+XFS\_EXT\_UNWRITTEN.
+
+**rm\_offset**
+The 54-bit logical file block offset, if rm\_owner describes an inode.
+
+**Note**
+
+The single-bit flag values rm\_unwritten, rm\_fork, and rm\_bmbt are
+packed into the larger fields in the C structure definition.
+
+The key has the following structure:
+
+.. code:: c
+
+struct xfs_rtrmap_key {
+ __be64 rm_startblock;
+ __be64 rm_owner;
+ __be64  

[PATCH 19/22] docs: add XFS directory structure to the DS book

2018-10-03 Thread Darrick J. Wong
From: Darrick J. Wong 

Signed-off-by: Darrick J. Wong 
---
 .../xfs-data-structures/directories.rst| 1688 
 .../filesystems/xfs-data-structures/dynamic.rst|1 
 2 files changed, 1689 insertions(+)
 create mode 100644 
Documentation/filesystems/xfs-data-structures/directories.rst


diff --git a/Documentation/filesystems/xfs-data-structures/directories.rst 
b/Documentation/filesystems/xfs-data-structures/directories.rst
new file mode 100644
index ..34f34c7aedea
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/directories.rst
@@ -0,0 +1,1688 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Directories
+---
+
+**Note**
+
+Only v2 directories covered here. v1 directories are obsolete.
+
+**Note**
+
+The term "block" in this section will refer to directory blocks, not
+filesystem blocks unless otherwise specified.
+
+The size of a "directory block" is defined by the
+`superblock’s <#superblocks>`__ sb\_dirblklog value. The size in bytes =
+sb\_blocksize × 2\ :sup:`sb\_dirblklog`. For example, if sb\_blocksize = 4096
+and sb\_dirblklog = 2, the directory block size is 16384 bytes. Directory
+blocks are always allocated in multiples based on sb\_dirblklog. Directory
+blocks cannot be more that 65536 bytes in size.
+
+All directory entries contain the following "data":
+
+-  The entry’s name (counted string consisting of a single byte namelen
+   followed by name consisting of an array of 8-bit chars without a NULL
+   terminator).
+
+-  The entry’s absolute `inode number <#inode-numbers>`__, which are always 64
+   bits (8 bytes) in size except a special case for shortform directories.
+
+-  An offset or tag used for iterative readdir calls.
+
+-  If the XFS\_SB\_FEAT\_INCOMPAT\_FTYPE feature flag is set, each directory
+   entry contains an ftype field that caches the inode’s type to avoid having
+   to perform an inode lookup.
+
+.. list-table::
+   :widths: 28 52
+   :header-rows: 1
+
+   * - Flag
+ - Description
+
+   * - XFS_DIR3_FT_UNKNOWN
+ - Entry points to an unknown inode type.  This should never appear on
+   disk.
+
+   * - XFS_DIR3_FT_REG_FILE
+ - Entry points to a file.
+
+   * - XFS_DIR3_FT_DIR
+ - Entry points to another directory.
+
+   * - XFS_DIR3_FT_CHRDEV
+ - Entry points to a character device.
+
+   * - XFS_DIR3_FT_BLKDEV
+ - Entry points to a block device.
+
+   * - XFS_DIR3_FT_FIFO
+ - Entry points to a FIFO.
+
+   * - XFS_DIR3_FT_SOCK
+ - Entry points to a socket.
+
+   * - XFS_DIR3_FT_SYMLINK
+ - Entry points to a symbolic link.
+
+   * - XFS_DIR3_FT_WHT
+ - Entry points to an overlayfs whiteout file.  This (as far as the author
+   knows) has never appeared on disk.
+
+Table: ftype values
+
+All non-shortform directories also contain two additional structures:
+"leaves"
+and "freespace indexes".
+
+-  Leaves contain the sorted hashed name value (xfs\_da\_hashname() in
+   xfs\_da\_btree.c) and associated "address" which points to the
+   effective offset into the directory’s data structures. Leaves are used to
+   optimise lookup operations.
+
+-  Freespace indexes contain free space/empty entry tracking for quickly
+   finding an appropriately sized location for new entries. They maintain the
+   largest free space for each "data" block.
+
+A few common types are used for the directory structures:
+
+.. code:: c
+
+typedef __uint16_t xfs_dir2_data_off_t;
+typedef __uint32_t xfs_dir2_dataptr_t;
+
+Short Form Directories
+~~
+
+-  Directory entries are stored within the inode.
+
+-  The only data stored is the name, inode number, and offset. No "leaf" or
+   "freespace index" information is required as an inode can only store a
+   few entries.
+
+-  "." is not stored (as it’s in the inode itself), and ".." is a
+   dedicated parent field in the header.
+
+-  The number of directories that can be stored in an inode depends on the
+   `inode <#on-disk-inode>`__ size, the number of entries, the length of the
+   entry names, and extended attribute data.
+
+-  Once the number of entries exceeds the space available in the inode, the
+   format is converted to a `block directory <#block-directories>`__.
+
+-  Shortform directory data is packed as tightly as possible on the disk with
+   the remaining space zeroed:
+
+.. code:: c
+
+typedef struct xfs_dir2_sf {
+ xfs_dir2_sf_hdr_t hdr;
+ xfs_dir2_sf_entry_t   list[1];
+} xfs_dir2_sf_t;
+
+**hdr**
+Short form directory header.
+
+**list**
+An array of variable-length directory entry records.
+
+.. code:: c
+
+typedef struct xfs_dir2_sf_hdr {
+ __uint8_t count;
+ __uint8_t i8count;
+ xfs_dir2_inou_t   parent;
+} xfs_dir2_sf_hdr_t;

[PATCH 14/22] docs: add XFS log to the DS book

2018-10-03 Thread Darrick J. Wong
From: Darrick J. Wong 

Signed-off-by: Darrick J. Wong 
---
 .../filesystems/xfs-data-structures/globals.rst|1 
 .../xfs-data-structures/journaling_log.rst | 1442 
 2 files changed, 1443 insertions(+)
 create mode 100644 
Documentation/filesystems/xfs-data-structures/journaling_log.rst


diff --git a/Documentation/filesystems/xfs-data-structures/globals.rst 
b/Documentation/filesystems/xfs-data-structures/globals.rst
index c91b1d24d6e7..8ce83deafae5 100644
--- a/Documentation/filesystems/xfs-data-structures/globals.rst
+++ b/Documentation/filesystems/xfs-data-structures/globals.rst
@@ -6,3 +6,4 @@ Global Structures
 .. include:: btrees.rst
 .. include:: dabtrees.rst
 .. include:: allocation_groups.rst
+.. include:: journaling_log.rst
diff --git a/Documentation/filesystems/xfs-data-structures/journaling_log.rst 
b/Documentation/filesystems/xfs-data-structures/journaling_log.rst
new file mode 100644
index ..78d8fa1933ae
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/journaling_log.rst
@@ -0,0 +1,1442 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Journaling Log
+--
+
+**Note**
+
+Only v2 log format is covered here.
+
+The XFS journal exists on disk as a reserved extent of blocks within the
+filesystem, or as a separate journal device. The journal itself can be thought
+of as a series of log records; each log record contains a part of or a whole
+transaction. A transaction consists of a series of log operation headers
+("log items"), formatting structures, and raw data. The first operation in
+a transaction establishes the transaction ID and the last operation is a
+commit record. The operations recorded between the start and commit operations
+represent the metadata changes made by the transaction. If the commit
+operation is missing, the transaction is incomplete and cannot be recovered.
+
+Log Records
+~~~
+
+The XFS log is split into a series of log records. Log records seem to
+correspond to an in-core log buffer, which can be up to 256KiB in size. Each
+record has a log sequence number, which is the same LSN recorded in the v5
+metadata integrity fields.
+
+Log sequence numbers are a 64-bit quantity consisting of two 32-bit
+quantities. The upper 32 bits are the
+"cycle number", which increments every time XFS
+cycles through the log.  The lower 32 bits are the "block number", which
+is assigned when a transaction is committed, and should correspond to the
+block offset within the log.
+
+A log record begins with the following header, which occupies 512 bytes on
+disk:
+
+.. code:: c
+
+typedef struct xlog_rec_header {
+ __be32h_magicno;
+ __be32h_cycle;
+ __be32h_version;
+ __be32h_len;
+ __be64h_lsn;
+ __be64h_tail_lsn;
+ __le32h_crc;
+ __be32h_prev_block;
+ __be32h_num_logops;
+ __be32h_cycle_data[XLOG_HEADER_CYCLE_SIZE / BBSIZE];
+ /* new fields */
+ __be32h_fmt;
+ uuid_th_fs_uuid;
+ __be32h_size;
+} xlog_rec_header_t;
+
+**h\_magicno**
+The magic number of log records, 0xfeedbabe.
+
+**h\_cycle**
+Cycle number of this log record.
+
+**h\_version**
+Log record version, currently 2.
+
+**h\_len**
+Length of the log record, in bytes. Must be aligned to a 64-bit boundary.
+
+**h\_lsn**
+Log sequence number of this record.
+
+**h\_tail\_lsn**
+Log sequence number of the first log record with uncommitted buffers.
+
+**h\_crc**
+Checksum of the log record header, the cycle data, and the log records
+themselves.
+
+**h\_prev\_block**
+Block number of the previous log record.
+
+**h\_num\_logops**
+The number of log operations in this record.
+
+**h\_cycle\_data**
+The first u32 of each log sector must contain the cycle number. Since log
+item buffers are formatted without regard to this requirement, the
+original contents of the first four bytes of each sector in the log are
+copied into the corresponding element of this array. After that, the first
+four bytes of those sectors are stamped with the cycle number. This
+process is reversed at recovery time. If there are more sectors in this
+log record than there are slots in this array, the cycle data continues
+for as many sectors are needed; each sector is formatted as type
+xlog\_rec\_ext\_header.
+
+**h\_fmt**
+Format of the log record. This is one of the following values:
+
+.. list-table::
+   :widths: 24 56
+   :header-rows: 1
+
+   * - Format value
+ - Log format
+
+   * - XLOG\_FMT\_UNKNOWN
+ - Unknown. Perhaps this log is corrupt.
+
+   * - XLOG\_FMT\_LINUX\_LE
+ - Little-endian Linux.
+
+   * - XLOG\_FMT\_LINUX\_BE
+ - Big-endian Linux.
+
+   * - XLOG\_FM

[PATCH 18/22] docs: add XFS data extent map doc to the DS book

2018-10-03 Thread Darrick J. Wong
From: Darrick J. Wong 

Signed-off-by: Darrick J. Wong 
---
 .../xfs-data-structures/data_extents.rst   |  337 
 .../filesystems/xfs-data-structures/dynamic.rst|1 
 2 files changed, 338 insertions(+)
 create mode 100644 
Documentation/filesystems/xfs-data-structures/data_extents.rst


diff --git a/Documentation/filesystems/xfs-data-structures/data_extents.rst 
b/Documentation/filesystems/xfs-data-structures/data_extents.rst
new file mode 100644
index ..a410397e9892
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/data_extents.rst
@@ -0,0 +1,337 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Data Extents
+
+
+XFS manages space using extents, which are defined as a starting location and
+length. A fork in an XFS inode maps a logical offset to a space extent. This
+enables a file’s extent map to support sparse files (i.e. "holes" in the
+file). A flag is also used to specify if the extent has been preallocated but
+has not yet been written (unwritten extent).
+
+A file can have more than one extent if one chunk of contiguous disk space is
+not available for the file. As a file grows, the XFS space allocator will
+attempt to keep space contiguous and to merge extents. If more than one file
+is being allocated space in the same AG at the same time, multiple extents for
+the files will occur as the extent allocations interleave. The effect of this
+can vary depending on the extent allocator used in the XFS driver.
+
+An extent is 128 bits in size and uses the following packed layout:
+
+.. figure:: images/31.png
+   :alt: Extent record format
+
+   Extent record format
+
+The extent is represented by the xfs\_bmbt\_rec structure which uses a big
+endian format on-disk. In-core management of extents use the xfs\_bmbt\_irec
+structure which is the unpacked version of xfs\_bmbt\_rec:
+
+.. code:: c
+
+struct xfs_bmbt_irec {
+ xfs_fileoff_t br_startoff;
+ xfs_fsblock_t br_startblock;
+ xfs_filblks_t br_blockcount;
+ xfs_exntst_t  br_state;
+};
+
+**br\_startoff**
+Logical block offset of this mapping.
+
+**br\_startblock**
+Filesystem block of this mapping.
+
+**br\_blockcount**
+The length of this mapping.
+
+**br\_state**
+The extent br\_state field uses the following enum declaration:
+
+.. code:: c
+
+typedef enum {
+ XFS_EXT_NORM,
+ XFS_EXT_UNWRITTEN,
+ XFS_EXT_INVALID
+} xfs_exntst_t;
+
+Some other points about extents:
+
+-  The xfs\_bmbt\_rec\_32\_t and xfs\_bmbt\_rec\_64\_t structures were
+   effectively the same as xfs\_bmbt\_rec\_t, just different representations
+   of the same 128 bits in on-disk big endian format. xfs\_bmbt\_rec\_32\_t
+   was removed and xfs\_bmbt\_rec\_64\_t renamed to xfs\_bmbt\_rec\_t some
+   time ago.
+
+-  When a file is created and written to, XFS will endeavour to keep the
+   extents within the same AG as the inode. It may use a different AG if the
+   AG is busy or there is no space left in it.
+
+-  If a file is zero bytes long, it will have no extents and di\_nblocks and
+   di\_nexents will be zero. Any file with data will have at least one extent,
+   and each extent can use from 1 to over 2 million blocks (2:sup:`21`) on the
+   filesystem. For a default 4KB block size filesystem, a single extent can be
+   up to 8GB in length.
+
+The following two subsections cover the two methods of storing extent
+information for a file. The first is the fastest and simplest where the inode
+completely contains an extent array to the file’s data. The second is slower
+and more complex B+tree which can handle thousands to millions of extents
+efficiently.
+
+Extent List
+~~~
+
+If the entire extent list is short enough to fit within the inode’s fork
+region, we say that the fork is in "extent list" format. This is the most
+optimal in terms of speed and resource consumption. The trade-off is the file
+can only have a few extents before the inode runs out of space.
+
+The data fork of the inode contains an array of extents; the size of the array
+is determined by the inode’s di\_nextents value.
+
+.. figure:: images/32.png
+   :alt: Inode data fork extent layout
+
+   Inode data fork extent layout
+
+The number of extents that can fit in the inode depends on the inode size and
+di\_forkoff. For a default 256 byte inode with no extended attributes, a file
+can have up to 9 extents with this format. On a default v5 filesystem with 512
+byte inodes, a file can have up to 21 extents with this format. Beyond that,
+extents have to use the B+tree format.
+
+xfs\_db Inode Data Fork Extents Example
+^^^
+
+An 8MB file with one extent:
+
+::
+
+xfs_db> inode 
+xfs_db> p
+core.magic = 0x494e
+core.mode = 0100644
+core.version = 1
+core.format = 2 (extents)
+...
+core.size = 829440

[PATCH 15/22] docs: add XFS internal inodes to the DS book

2018-10-03 Thread Darrick J. Wong
From: Darrick J. Wong 

Signed-off-by: Darrick J. Wong 
---
 .../filesystems/xfs-data-structures/globals.rst|1 
 .../xfs-data-structures/internal_inodes.rst|  208 
 2 files changed, 209 insertions(+)
 create mode 100644 
Documentation/filesystems/xfs-data-structures/internal_inodes.rst


diff --git a/Documentation/filesystems/xfs-data-structures/globals.rst 
b/Documentation/filesystems/xfs-data-structures/globals.rst
index 8ce83deafae5..1662540e40ef 100644
--- a/Documentation/filesystems/xfs-data-structures/globals.rst
+++ b/Documentation/filesystems/xfs-data-structures/globals.rst
@@ -7,3 +7,4 @@ Global Structures
 .. include:: dabtrees.rst
 .. include:: allocation_groups.rst
 .. include:: journaling_log.rst
+.. include:: internal_inodes.rst
diff --git a/Documentation/filesystems/xfs-data-structures/internal_inodes.rst 
b/Documentation/filesystems/xfs-data-structures/internal_inodes.rst
new file mode 100644
index ..4c3a1bf1f822
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/internal_inodes.rst
@@ -0,0 +1,208 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Internal Inodes
+---
+
+XFS allocates several inodes when a filesystem is created. These are internal
+and not accessible from the standard directory structure. These inodes are
+only accessible from the superblock.
+
+Quota Inodes
+
+
+Prior to version 5 filesystems, two inodes can be allocated for quota
+management. The first inode will be used for user quotas. The second inode
+will be used for group quotas or project quotas, depending on mount options.
+Group and project quotas are mutually exclusive features in these
+environments.
+
+In version 5 or later filesystems, each quota type is allocated its own inode,
+making it possible to use group and project quota management simultaneously.
+
+-  Project quota’s primary purpose is to track and monitor disk usage for
+   directories. For this to occur, the directory inode must have the
+   XFS\_DIFLAG\_PROJINHERIT flag set so all inodes created underneath the
+   directory inherit the project ID.
+
+-  Inodes and blocks owned by ID zero do not have enforced quotas, but only
+   quota accounting.
+
+-  Extended attributes do not contribute towards the ID’s quota.
+
+-  To access each ID’s quota information in the file, seek to the ID offset
+   multiplied by the size of xfs\_dqblk\_t (136 bytes).
+
+.. figure:: images/76.png
+   :alt: Quota inode layout
+
+   Quota inode layout
+
+Quota information is stored in the data extents of the reserved quota inodes
+as an array of the xfs\_dqblk structures, where there is one array element for
+each ID in the system:
+
+.. code:: c
+
+struct xfs_disk_dquot {
+ __be16d_magic;
+ __u8  d_version;
+ __u8  d_flags;
+ __be32d_id;
+ __be64d_blk_hardlimit;
+ __be64d_blk_softlimit;
+ __be64d_ino_hardlimit;
+ __be64d_ino_softlimit;
+ __be64d_bcount;
+ __be64d_icount;
+ __be32d_itimer;
+ __be32d_btimer;
+ __be16d_iwarns;
+ __be16d_bwarns;
+ __be32d_pad0;
+ __be64d_rtb_hardlimit;
+ __be64d_rtb_softlimit;
+ __be64d_rtbcount;
+ __be32d_rtbtimer;
+ __be16d_rtbwarns;
+ __be16d_pad;
+};
+struct xfs_dqblk {
+ struct xfs_disk_dquot dd_diskdq;
+ char  dd_fill[4];
+
+ /* version 5 filesystem fields begin here */
+ __be32dd_crc;
+ __be64dd_lsn;
+ uuid_tdd_uuid;
+};
+
+**d\_magic**
+Specifies the signature where these two bytes are 0x4451
+(XFS\_DQUOT\_MAGIC), or \`\`DQ'' in ASCII.
+
+**d\_version**
+The structure version, currently this is 1 (XFS\_DQUOT\_VERSION).
+
+**d\_flags**
+Specifies which type of ID the structure applies to:
+
+.. code:: c
+
+#define XFS_DQ_USER  0x0001
+#define XFS_DQ_PROJ  0x0002
+#define XFS_DQ_GROUP 0x0004
+
+**d\_id**
+The ID for the quota structure. This will be a uid, gid or projid based on
+the value of d\_flags.
+
+**d\_blk\_hardlimit**
+The hard limit for the number of filesystem blocks the ID can own. The ID
+will not be able to use more space than this limit. If it is attempted,
+ENOSPC will be returned.
+
+**d\_blk\_softlimit**
+The soft limit for the number of filesystem blocks the ID can own. The ID
+can temporarily use more space than by d\_blk\_softlimit up to
+d\_blk\_hardlimit. If the space is not freed by the time limit specified
+by ID zero’s d\_btimer value, the ID

[PATCH 12/22] docs: add XFS reverse mapping structures to the DS book

2018-10-03 Thread Darrick J. Wong
From: Darrick J. Wong 

Signed-off-by: Darrick J. Wong 
---
 .../xfs-data-structures/allocation_groups.rst  |2 
 .../filesystems/xfs-data-structures/rmapbt.rst |  336 
 2 files changed, 338 insertions(+)
 create mode 100644 Documentation/filesystems/xfs-data-structures/rmapbt.rst


diff --git 
a/Documentation/filesystems/xfs-data-structures/allocation_groups.rst 
b/Documentation/filesystems/xfs-data-structures/allocation_groups.rst
index 30d169ab5cc5..6c0ffd3a170b 100644
--- a/Documentation/filesystems/xfs-data-structures/allocation_groups.rst
+++ b/Documentation/filesystems/xfs-data-structures/allocation_groups.rst
@@ -1379,3 +1379,5 @@ response times that come from metadata operations.
 
 None of the XFS per-AG B+trees are involved with real time files. It is not
 possible for real time files to share data blocks.
+
+.. include:: rmapbt.rst
diff --git a/Documentation/filesystems/xfs-data-structures/rmapbt.rst 
b/Documentation/filesystems/xfs-data-structures/rmapbt.rst
new file mode 100644
index ..eefcee5d4e95
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/rmapbt.rst
@@ -0,0 +1,336 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Reverse-Mapping B+tree
+~~
+
+If the feature is enabled, each allocation group has its own reverse
+block-mapping B+tree, which grows in the free space like the free space
+B+trees. As mentioned in the chapter about
+`reconstruction <#metadata-reconstruction>`__, this data structure is another 
piece of
+the puzzle necessary to reconstruct the data or attribute fork of a file from
+reverse-mapping records; we can also use it to double-check allocations to
+ensure that we are not accidentally cross-linking blocks, which can cause
+severe damage to the filesystem.
+
+This B+tree is only present if the XFS\_SB\_FEAT\_RO\_COMPAT\_RMAPBT feature
+is enabled. The feature requires a version 5 filesystem.
+
+Each record in the reverse-mapping B+tree has the following structure:
+
+.. code:: c
+
+struct xfs_rmap_rec {
+ __be32 rm_startblock;
+ __be32 rm_blockcount;
+ __be64 rm_owner;
+ __be64 rm_fork:1;
+ __be64 rm_bmbt:1;
+ __be64 rm_unwritten:1;
+ __be64 rm_unused:7;
+ __be64 rm_offset:54;
+};
+
+**rm\_startblock**
+AG block number of this record.
+
+**rm\_blockcount**
+The length of this extent.
+
+**rm\_owner**
+A 64-bit number describing the owner of this extent. This is typically the
+absolute inode number, but can also correspond to one of the following:
+
+.. list-table::
+   :widths: 28 52
+   :header-rows: 1
+
+   * - Flag
+ - Description
+   * - XFS\_RMAP\_OWN\_NULL
+ - No owner. This should never appear on disk.
+
+   * - XFS\_RMAP\_OWN\_UNKNOWN
+ - Unknown owner; for EFI recovery. This should never appear on disk.
+
+   * - XFS\_RMAP\_OWN\_FS
+ - Allocation group headers.
+
+   * - XFS\_RMAP\_OWN\_LOG
+ - XFS log blocks.
+
+   * - XFS\_RMAP\_OWN\_AG
+ - Per-allocation group B+tree blocks. This means free space B+tree blocks,
+   blocks on the freelist, and reverse-mapping B+tree blocks.
+
+   * - XFS\_RMAP\_OWN\_INOBT
+ - Per-allocation group inode B+tree blocks. This includes free inode
+   B+tree blocks.
+
+   * - XFS\_RMAP\_OWN\_INODES
+ - Inode chunks.
+
+   * - XFS\_RMAP\_OWN\_REFC
+ - Per-allocation group refcount B+tree blocks. This will be used for
+   reflink support.
+
+   * - XFS\_RMAP\_OWN\_COW
+ - Blocks that have been reserved for a copy-on-write operation that has
+   not completed.
+
+Table: Special owner values
+
+**rm\_fork**
+If rm\_owner describes an inode, this can be 1 if this record is for an
+attribute fork.
+
+**rm\_bmbt**
+If rm\_owner describes an inode, this can be 1 to signify that this record
+is for a block map B+tree block. In this case, rm\_offset has no meaning.
+
+**rm\_unwritten**
+A flag indicating that the extent is unwritten. This corresponds to the
+flag in the `extent record <#data-extents>`__ format which means
+XFS\_EXT\_UNWRITTEN.
+
+**rm\_offset**
+The 54-bit logical file block offset, if rm\_owner describes an inode.
+Meaningless otherwise.
+
+**Note**
+
+The single-bit flag values rm\_unwritten, rm\_fork, and rm\_bmbt are
+packed into the larger fields in the C structure definition.
+
+The key has the following structure:
+
+.. code:: c
+
+struct xfs_rmap_key {
+ __be32 rm_startblock;
+ __be64 rm_owner;
+ __be64 rm_fork:1;
+ __be64 rm_bmbt:1;
+ __be64 rm_reserved:1;
+ __be64 rm_unused:7;
+ __be64  

[PATCH 11/22] docs: add XFS allocation group metadata to the DS book

2018-10-03 Thread Darrick J. Wong
From: Darrick J. Wong 

Signed-off-by: Darrick J. Wong 
---
 .../xfs-data-structures/allocation_groups.rst  | 1381 
 .../filesystems/xfs-data-structures/globals.rst|1 
 2 files changed, 1382 insertions(+)
 create mode 100644 
Documentation/filesystems/xfs-data-structures/allocation_groups.rst


diff --git 
a/Documentation/filesystems/xfs-data-structures/allocation_groups.rst 
b/Documentation/filesystems/xfs-data-structures/allocation_groups.rst
new file mode 100644
index ..30d169ab5cc5
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/allocation_groups.rst
@@ -0,0 +1,1381 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Allocation Groups
+-
+
+As mentioned earlier, XFS filesystems are divided into a number of equally
+sized chunks called Allocation Groups. Each AG can almost be thought of as an
+individual filesystem that maintains its own space usage. Each AG can be up to
+one terabyte in size (512 bytes × 2\ :sup:`31`), regardless of the underlying
+device’s sector size.
+
+Each AG has the following characteristics:
+
+-  A super block describing overall filesystem info
+
+-  Free space management
+
+-  Inode allocation and tracking
+
+-  Reverse block-mapping index (optional)
+
+-  Data block reference count index (optional)
+
+Having multiple AGs allows XFS to handle most operations in parallel without
+degrading performance as the number of concurrent accesses increases.
+
+The only global information maintained by the first AG (primary) is free space
+across the filesystem and total inode counts. If the
+XFS\_SB\_VERSION2\_LAZYSBCOUNTBIT flag is set in the superblock, these are
+only updated on-disk when the filesystem is cleanly unmounted (umount or
+shutdown).
+
+Immediately after a mkfs.xfs, the primary AG has the following disk layout;
+the subsequent AGs do not have any inodes allocated:
+
+.. figure:: images/6.png
+   :alt: Allocation group layout
+
+   Allocation group layout
+
+Each of these structures are expanded upon in the following sections.
+
+Superblocks
+~~~
+
+Each AG starts with a superblock. The first one, in AG 0, is the primary
+superblock which stores aggregate AG information. Secondary superblocks are
+only used by xfs\_repair when the primary superblock has been corrupted. A
+superblock is one sector in length.
+
+The superblock is defined by the following structure. The description of each
+field follows.
+
+.. code:: c
+
+struct xfs_sb
+{
+__uint32_t  sb_magicnum;
+__uint32_t  sb_blocksize;
+xfs_rfsblock_t  sb_dblocks;
+xfs_rfsblock_t  sb_rblocks;
+xfs_rtblock_t   sb_rextents;
+uuid_t  sb_uuid;
+xfs_fsblock_t   sb_logstart;
+xfs_ino_t   sb_rootino;
+xfs_ino_t   sb_rbmino;
+xfs_ino_t   sb_rsumino;
+xfs_agblock_t   sb_rextsize;
+xfs_agblock_t   sb_agblocks;
+xfs_agnumber_t  sb_agcount;
+xfs_extlen_tsb_rbmblocks;
+xfs_extlen_tsb_logblocks;
+__uint16_t  sb_versionnum;
+__uint16_t  sb_sectsize;
+__uint16_t  sb_inodesize;
+__uint16_t  sb_inopblock;
+charsb_fname[12];
+__uint8_t   sb_blocklog;
+__uint8_t   sb_sectlog;
+__uint8_t   sb_inodelog;
+__uint8_t   sb_inopblog;
+__uint8_t   sb_agblklog;
+__uint8_t   sb_rextslog;
+__uint8_t   sb_inprogress;
+__uint8_t   sb_imax_pct;
+__uint64_t  sb_icount;
+__uint64_t  sb_ifree;
+__uint64_t  sb_fdblocks;
+__uint64_t  sb_frextents;
+xfs_ino_t   sb_uquotino;
+xfs_ino_t   sb_gquotino;
+__uint16_t  sb_qflags;
+__uint8_t   sb_flags;
+__uint8_t   sb_shared_vn;
+xfs_extlen_tsb_inoalignmt;
+__uint32_t  sb_unit;
+__uint32_t  sb_width;
+__uint8_t   sb_dirblklog;
+__uint8_t   sb_logsectlog;
+__uint16_t  sb_logsectsize;
+__uint32_t  sb_logsunit;
+__uint32_t  sb_features2;
+__uint32_t  sb_bad_features2;
+
+/* version 5 superblock fields start here */
+__uint32_t  sb_features_compat;
+__uint32_t  sb_features_ro_compat;
+__uint32_t  sb_features_incompat;
+__uint32_t  sb_features_log_incompat;
+
+__uint32_t  sb_crc;
+xfs_extlen_tsb_spino_align;
+
+xfs_ino_t   sb_pquotino;
+xfs_lsn_t   sb_lsn;
+uuid_t  sb_meta_uuid;
+xfs_ino_t   sb_rrmapino;
+};
+
+**sb\_magicnum**
+Identifies the filesystem. Its value is XFS\_SB\_MAGIC "XFSB"
+(0x58465342).
+
+**sb\_blocksize**
+The size of a basic unit of space allocation in bytes. Typically, this is
+4096 (4KB) but can range from 512 to 65536 byte

[PATCH 13/22] docs: add XFS refcount btree structure to DS book

2018-10-03 Thread Darrick J. Wong
From: Darrick J. Wong 

Signed-off-by: Darrick J. Wong 
---
 .../xfs-data-structures/allocation_groups.rst  |1 
 .../filesystems/xfs-data-structures/refcountbt.rst |  154 
 2 files changed, 155 insertions(+)
 create mode 100644 Documentation/filesystems/xfs-data-structures/refcountbt.rst


diff --git 
a/Documentation/filesystems/xfs-data-structures/allocation_groups.rst 
b/Documentation/filesystems/xfs-data-structures/allocation_groups.rst
index 6c0ffd3a170b..76c6ddcd02ac 100644
--- a/Documentation/filesystems/xfs-data-structures/allocation_groups.rst
+++ b/Documentation/filesystems/xfs-data-structures/allocation_groups.rst
@@ -1381,3 +1381,4 @@ None of the XFS per-AG B+trees are involved with real 
time files. It is not
 possible for real time files to share data blocks.
 
 .. include:: rmapbt.rst
+.. include:: refcountbt.rst
diff --git a/Documentation/filesystems/xfs-data-structures/refcountbt.rst 
b/Documentation/filesystems/xfs-data-structures/refcountbt.rst
new file mode 100644
index ..0f2b818959df
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/refcountbt.rst
@@ -0,0 +1,154 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Reference Count B+tree
+~~
+
+To support the sharing of file data blocks (reflink), each allocation group
+has its own reference count B+tree, which grows in the allocated space like
+the inode B+trees. This data could be gleaned by performing an interval query
+of the reverse-mapping B+tree, but doing so would come at a huge performance
+penalty. Therefore, this data structure is a cache of computable information.
+
+This B+tree is only present if the XFS\_SB\_FEAT\_RO\_COMPAT\_REFLINK feature
+is enabled. The feature requires a version 5 filesystem.
+
+Each record in the reference count B+tree has the following structure:
+
+.. code:: c
+
+struct xfs_refcount_rec {
+ __be32 rc_startblock;
+ __be32 rc_blockcount;
+ __be32 rc_refcount;
+};
+
+**rc\_startblock**
+AG block number of this record. The high bit is set for all records
+referring to an extent that is being used to stage a copy on write
+operation. This reduces recovery time during mount operations. The
+reference count of these staging events must only be 1.
+
+**rc\_blockcount**
+The length of this extent.
+
+**rc\_refcount**
+Number of mappings of this filesystem extent.
+
+Node pointers are an AG relative block pointer:
+
+.. code:: c
+
+struct xfs_refcount_key {
+ __be32 rc_startblock;
+};
+
+-  As the reference counting is AG relative, all the block numbers are only
+   32-bits.
+
+-  The bb\_magic value is "R3FC" (0x52334643).
+
+-  The xfs\_btree\_sblock\_t header is used for intermediate B+tree node as
+   well as the leaves.
+
+xfs\_db refcntbt Example
+
+
+For this example, an XFS filesystem was populated with a root filesystem and a
+deduplication program was run to create shared blocks:
+
+::
+
+xfs_db> agf 0
+xfs_db> addr refcntroot
+xfs_db> p
+magic = 0x52334643
+level = 1
+numrecs = 6
+leftsib = null
+rightsib = null
+bno = 36892
+lsn = 0x24ec2
+uuid = f1f89746-e00b-49c9-96b3-ecef0f2f14ae
+owner = 0
+crc = 0x75f35128 (correct)
+keys[1-6] = [startblock] 1:[14] 2:[65633] 3:[65780] 4:[94571] 5:[117201] 
6:[152442]
+ptrs[1-6] = 1:7 2:25836 3:25835 4:18447 5:18445 6:18449
+xfs_db> addr ptrs[3]
+xfs_db> p
+magic = 0x52334643
+level = 0
+numrecs = 80
+leftsib = 25836
+rightsib = 18447
+bno = 51670
+lsn = 0x24ec2
+uuid = f1f89746-e00b-49c9-96b3-ecef0f2f14ae
+owner = 0
+crc = 0xc3962813 (correct)
+recs[1-80] = [startblock,blockcount,refcount,cowflag]
+1:[65780,1,2,0] 2:[65781,1,3,0] 3:[65785,2,2,0] 4:[66640,1,2,0]
+5:[69602,4,2,0] 6:[72256,16,2,0] 7:[72871,4,2,0] 8:[72879,20,2,0]
+9:[73395,4,2,0] 10:[75063,4,2,0] 11:[79093,4,2,0] 12:[86344,16,2,0]
+...
+80:[35235,10,1,1]
+
+Notice record 80. The copy on write flag is set and the reference count is 1,
+which indicates that the extent 35,235 - 35,244 are being used to stage a copy
+on write activity. The "cowflag" field is the high bit of rc\_startblock.
+
+Record 6 in the reference count B+tree for AG 0 indicates that the AG extent
+starting at block 72,256 and running for 16 blocks has a reference count of 2.
+This means that there are two files sharing the block:
+
+::
+
+xfs_db> blockget -n
+xfs_db> fsblock 72256
+xfs_db> blockuse
+block 72256 (0/72256) type rldata inode 25169197
+
+The blockuse type changes to "rldata" to indicate that the block is shared
+data. Unfortunately, blockuse only tells us about one block owner. If we
+happen to have enabled the reverse-mapp

[PATCH 10/22] docs: add XFS dir/attr btree structure to the DS book

2018-10-03 Thread Darrick J. Wong
From: Darrick J. Wong 

Signed-off-by: Darrick J. Wong 
---
 .../filesystems/xfs-data-structures/dabtrees.rst   |  221 
 .../filesystems/xfs-data-structures/globals.rst|1 
 2 files changed, 222 insertions(+)
 create mode 100644 Documentation/filesystems/xfs-data-structures/dabtrees.rst


diff --git a/Documentation/filesystems/xfs-data-structures/dabtrees.rst 
b/Documentation/filesystems/xfs-data-structures/dabtrees.rst
new file mode 100644
index ..9daac6295941
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/dabtrees.rst
@@ -0,0 +1,221 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Variable Length Record B+trees
+--
+
+Directories and extended attributes are implemented as a simple key-value
+record store inside the blocks pointed to by the data or attribute fork of a
+file. Blocks referenced by either data structure are block offsets of an inode
+fork, not physical blocks.
+
+Directory and attribute data are stored as a linear array of variable-length
+records in the low blocks of a fork. Both data types share the property that
+record keys and record values are both arbitrary and unique sequences of
+bytes. See the respective sections about `directories <#directories>`__ or
+`attributes <#extended-attributes>`__ for more information about the exact
+record formats.
+
+The dir/attr b+tree (or "dabtree"), if present, computes a hash of the record
+key to produce the b+tree key, and b+tree keys are used to index the fork
+block in which the record may be found. Unlike the fixed-length b+trees, the
+variable length b+trees can index the same key multiple times. B+tree
+keypointers and records both take this format:
+
+::
+
++-+--+
+| hashval | before_block |
++-+--+
+
+The "before block" is the block offset in the inode fork of the block in which
+we can find the record whose hashed key is "hashval". The hash function is as
+follows:
+
+.. code:: c
+
+#define rol32(x,y) (((x) << (y)) | ((x) >> (32 - (y
+
+xfs_dahash_t
+xfs_da_hashname(const uint8_t *name, int namelen)
+{
+xfs_dahash_t hash;
+
+/*
+ * Do four characters at a time as long as we can.
+ */
+for (hash = 0; namelen >= 4; namelen -= 4, name += 4)
+hash = (name[0] << 21) ^ (name[1] << 14) ^ (name[2] << 7) ^
+   (name[3] << 0) ^ rol32(hash, 7 * 4);
+
+/*
+ * Now do the rest of the characters.
+ */
+switch (namelen) {
+case 3:
+return (name[0] << 14) ^ (name[1] << 7) ^ (name[2] << 0) ^
+   rol32(hash, 7 * 3);
+case 2:
+return (name[0] << 7) ^ (name[1] << 0) ^ rol32(hash, 7 * 2);
+case 1:
+return (name[0] << 0) ^ rol32(hash, 7 * 1);
+default: /* case 0: */
+return hash;
+}
+}
+
+.. _directory-attribute-block-header:
+
+Block Headers
+~
+
+-  Tree nodes, leaf and node `directories <#directories>`__, and leaf and node
+   `extended attributes <#extended-attributes>`__ use the xfs\_da\_blkinfo\_t
+   filesystem block header. The structure appears as follows:
+
+.. code:: c
+
+typedef struct xfs_da_blkinfo {
+ __be32 forw;
+ __be32 back;
+ __be16 magic;
+ __be16 pad;
+} xfs_da_blkinfo_t;
+
+**forw**
+Logical block offset of the previous B+tree block at this level.
+
+**back**
+Logical block offset of the next B+tree block at this level.
+
+**magic**
+Magic number for this directory/attribute block.
+
+**pad**
+Padding to maintain alignment.
+
+-  On a v5 filesystem, the leaves use the struct xfs\_da3\_blkinfo\_t
+   filesystem block header. This header is used in the same place as
+   xfs\_da\_blkinfo\_t:
+
+.. code:: c
+
+struct xfs_da3_blkinfo {
+ /* these values are inside xfs_da_blkinfo */
+ __be32 forw;
+ __be32 back;
+ __be16 magic;
+ __be16 pad;
+
+ __be32 crc;
+ __be64 blkno;
+ __be64 lsn;
+ uuid_t uuid;
+ __be64 owner;
+};
+
+**forw**
+Logical block offset of the previous B+tree block at this level.
+
+**back**
+Logical block offset of the next B+tree block at this level.
+
+**magic**
+Magic number for this directory/attribute block.
+
+**pad**
+Padding to maintain alignment.
+
+**crc**
+Checksum of the directory/attribute block.
+
+**blkno**
+Block number of this directory/attribute block.
+
+**lsn**
+Log seque

[PATCH 08/22] docs: add XFS testing chapter to the DS book

2018-10-03 Thread Darrick J. Wong
From: Darrick J. Wong 

Signed-off-by: Darrick J. Wong 
---
 .../filesystems/xfs-data-structures/overview.rst   |1 +
 .../filesystems/xfs-data-structures/testing.rst|   25 
 2 files changed, 26 insertions(+)
 create mode 100644 Documentation/filesystems/xfs-data-structures/testing.rst


diff --git a/Documentation/filesystems/xfs-data-structures/overview.rst 
b/Documentation/filesystems/xfs-data-structures/overview.rst
index 23eb71d65c93..d6711dc653d8 100644
--- a/Documentation/filesystems/xfs-data-structures/overview.rst
+++ b/Documentation/filesystems/xfs-data-structures/overview.rst
@@ -49,3 +49,4 @@ latency.
 .. include:: reconstruction.rst
 .. include:: common_types.rst
 .. include:: magic.rst
+.. include:: testing.rst
diff --git a/Documentation/filesystems/xfs-data-structures/testing.rst 
b/Documentation/filesystems/xfs-data-structures/testing.rst
new file mode 100644
index ..3d3386854408
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/testing.rst
@@ -0,0 +1,25 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Testing Filesystem Changes
+--
+
+People put a lot of trust in filesystems to preserve their data in a reliable
+fashion. To that end, it is very important that users and developers have
+access to a suite of regression tests that can be used to prove correct
+operation of any given filesystem code, or to analyze failures to fix problems
+found in the code. The XFS regression test suite, xfstests, is hosted at
+``git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git``. Most tests apply to
+filesystems in general, but the suite also contains tests for features
+specific to each filesystem.
+
+When fixing bugs, it is important to provide a testcase exposing the bug so
+that the developers can avoid a future re-occurrence of the regression.
+Furthermore, if you’re developing a new user-visible feature for XFS, please
+help the rest of the development community to sustain and maintain the whole
+codebase by providing generous test coverage to check its behavior.
+
+When altering, adding, or removing an on-disk data structure, please remember
+to update both the in-kernel structure size checks in xfs\_ondisk.h and to
+ensure that your changes are reflected in xfstest xfs/122. These regression
+tests enable us to detect compiler bugs, alignment problems, and anything else
+that might result in the creation of incompatible filesystem images.



[PATCH 06/22] docs: add XFS online repair chapter to DS book

2018-10-03 Thread Darrick J. Wong
From: Darrick J. Wong 

Signed-off-by: Darrick J. Wong 
---
 .../filesystems/xfs-data-structures/overview.rst   |1 
 .../xfs-data-structures/reconstruction.rst |   68 
 2 files changed, 69 insertions(+)
 create mode 100644 
Documentation/filesystems/xfs-data-structures/reconstruction.rst


diff --git a/Documentation/filesystems/xfs-data-structures/overview.rst 
b/Documentation/filesystems/xfs-data-structures/overview.rst
index d8d668ec6097..b1b3f711638b 100644
--- a/Documentation/filesystems/xfs-data-structures/overview.rst
+++ b/Documentation/filesystems/xfs-data-structures/overview.rst
@@ -46,3 +46,4 @@ latency.
 .. include:: self_describing_metadata.rst
 .. include:: delayed_logging.rst
 .. include:: reflink.rst
+.. include:: reconstruction.rst
diff --git a/Documentation/filesystems/xfs-data-structures/reconstruction.rst 
b/Documentation/filesystems/xfs-data-structures/reconstruction.rst
new file mode 100644
index ..10a7a728c50c
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/reconstruction.rst
@@ -0,0 +1,68 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Metadata Reconstruction
+---
+
+**Note**
+
+This is a theoretical discussion of how reconstruction could work; none of
+this is implemented as of 2018.
+
+A simple UNIX filesystem can be thought of in terms of a directed acyclic
+graph. To a first approximation, there exists a root directory node, which
+points to other nodes. Those other nodes can themselves be directories or they
+can be files. Each file, in turn, points to data blocks.
+
+XFS adds a few more details to this picture:
+
+-  The real root(s) of an XFS filesystem are the allocation group headers
+   (superblock, AGF, AGI, AGFL).
+
+-  Each allocation group’s headers point to various per-AG B+trees (free
+   space, inode, free inodes, free list, etc.)
+
+-  The free space B+trees point to unused extents;
+
+-  The inode B+trees point to blocks containing inode chunks;
+
+-  All superblocks point to the root directory and the log;
+
+-  Hardlinks mean that multiple directories can point to a single file node;
+
+-  File data block pointers are indexed by file offset;
+
+-  Files and directories can have a second collection of pointers to data
+   blocks which contain extended attributes;
+
+-  Large directories require multiple data blocks to store all the
+   subpointers;
+
+-  Still larger directories use high-offset data blocks to store a B+tree of
+   hashes to directory entries;
+
+-  Large extended attribute forks similarly use high-offset data blocks to
+   store a B+tree of hashes to attribute keys; and
+
+-  Symbolic links can point to data blocks.
+
+The beauty of this massive graph structure is that under normal circumstances,
+everything known to the filesystem is discoverable (access controls
+notwithstanding) from the root. The major weakness of this structure of course
+is that breaking a edge in the graph can render entire subtrees inaccessible.
+xfs\_repair “recovers” from broken directories by scanning for unlinked inodes
+and connecting them to /lost+found, but this isn’t sufficiently general to
+recover from breaks in other parts of the graph structure. Wouldn’t it be
+useful to have back pointers as a secondary data structure? The current repair
+strategy is to reconstruct whatever can be rebuilt, but to scrap anything that
+doesn’t check out.
+
+The `reverse-mapping B+tree <#reverse-mapping-b-tree>`__ fills in part of the
+puzzle. Since it contains copies of every entry in each inode’s data and
+attribute forks, we can fix a corrupted block map with these records.
+Furthermore, if the inode B+trees become corrupt, it is possible to visit all
+inode chunks using the reverse-mapping data. Should XFS ever gain the ability
+to store parent directory information in each inode, it also becomes possible
+to resurrect damaged directory trees, which should reduce the complaints about
+inodes ending up in /lost+found. Everything else in the per-AG primary
+metadata can already be reconstructed via xfs\_repair. Hopefully,
+reconstruction will not turn out to be a fool’s errand.



[PATCH 03/22] docs: add XFS self-describing metadata integrity doc to DS book

2018-10-03 Thread Darrick J. Wong
From: Darrick J. Wong 

Signed-off-by: Darrick J. Wong 
---
 .../filesystems/xfs-data-structures/overview.rst   |2 
 .../self_describing_metadata.rst   |  402 
 .../filesystems/xfs-self-describing-metadata.txt   |  350 -
 3 files changed, 404 insertions(+), 350 deletions(-)
 create mode 100644 
Documentation/filesystems/xfs-data-structures/self_describing_metadata.rst
 delete mode 100644 Documentation/filesystems/xfs-self-describing-metadata.txt


diff --git a/Documentation/filesystems/xfs-data-structures/overview.rst 
b/Documentation/filesystems/xfs-data-structures/overview.rst
index 43b48f30f7e8..8b3de9abcf39 100644
--- a/Documentation/filesystems/xfs-data-structures/overview.rst
+++ b/Documentation/filesystems/xfs-data-structures/overview.rst
@@ -42,3 +42,5 @@ filesystem operations can be carried out atomically in the 
case of a crash.
 Furthermore, there is the concept of a real-time device wherein allocations
 are tracked more simply and in larger chunks to reduce jitter in allocation
 latency.
+
+.. include:: self_describing_metadata.rst
diff --git 
a/Documentation/filesystems/xfs-data-structures/self_describing_metadata.rst 
b/Documentation/filesystems/xfs-data-structures/self_describing_metadata.rst
new file mode 100644
index ..f9d41c76e1d5
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/self_describing_metadata.rst
@@ -0,0 +1,402 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Metadata Integrity
+--
+
+Introduction
+
+
+The largest scalability problem facing XFS is not one of algorithmic
+scalability, but of verification of the filesystem structure. Scalabilty of
+the structures and indexes on disk and the algorithms for iterating them are
+adequate for supporting PB scale filesystems with billions of inodes, however
+it is this very scalability that causes the verification problem.
+
+Almost all metadata on XFS is dynamically allocated. The only fixed location
+metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all
+other metadata structures need to be discovered by walking the filesystem
+structure in different ways. While this is already done by userspace tools for
+validating and repairing the structure, there are limits to what they can
+verify, and this in turn limits the supportable size of an XFS filesystem.
+
+For example, it is entirely possible to manually use xfs\_db and a bit of
+scripting to analyse the structure of a 100TB filesystem when trying to
+determine the root cause of a corruption problem, but it is still mainly a
+manual task of verifying that things like single bit errors or misplaced
+writes weren’t the ultimate cause of a corruption event. It may take a few
+hours to a few days to perform such forensic analysis, so for at this scale
+root cause analysis is entirely possible.
+
+However, if we scale the filesystem up to 1PB, we now have 10x as much
+metadata to analyse and so that analysis blows out towards weeks/months of
+forensic work. Most of the analysis work is slow and tedious, so as the amount
+of analysis goes up, the more likely that the cause will be lost in the noise.
+Hence the primary concern for supporting PB scale filesystems is minimising
+the time and effort required for basic forensic analysis of the filesystem
+structure.
+
+Therefore, the version 5 disk format introduced larger headers for all
+metadata types, which enable the filesystem to check information being read
+from the disk more rigorously. Metadata integrity fields now include:
+
+-  **Magic** numbers, to classify all types of metadata. This is unchanged
+   from v4.
+
+-  A copy of the filesystem **UUID**, to confirm that a given disk block is
+   connected to the superblock.
+
+-  The **owner**, to avoid accessing a piece of metadata which belongs to some
+   other part of the filesystem.
+
+-  The filesystem **block number**, to detect misplaced writes.
+
+-  The **log serial number** of the last write to this block, to avoid
+   replaying obsolete log entries.
+
+-  A CRC32c **checksum** of the entire block, to detect minor corruption.
+
+Metadata integrity coverage has been extended to all metadata blocks in the
+filesystem, with the following notes:
+
+-  Inodes can have multiple "owners" in the directory tree; therefore the
+   record contains the inode number instead of an owner or a block number.
+
+-  Superblocks have no owners.
+
+-  The disk quota file has no owner or block numbers.
+
+-  Metadata owned by files list the inode number as the owner.
+
+-  Per-AG data and B+tree blocks list the AG number as the owner.
+
+-  Per-AG header sectors don’t list owners or block numbers, since they have
+   fixed locations.
+
+-  Remote attribute blocks are not logged and therefore the LSN must be -1.
+
+This functionality enables XFS to decide that a block contents are so
+unexpected that it should stop immediately. Unfortunately checksums do 

[PATCH 04/22] docs: add XFS delayed logging design doc to DS book

2018-10-03 Thread Darrick J. Wong
From: Darrick J. Wong 

Signed-off-by: Darrick J. Wong 
---
 .../xfs-data-structures/delayed_logging.rst|  828 
 .../filesystems/xfs-data-structures/overview.rst   |1 
 .../filesystems/xfs-delayed-logging-design.txt |  793 ---
 3 files changed, 829 insertions(+), 793 deletions(-)
 create mode 100644 
Documentation/filesystems/xfs-data-structures/delayed_logging.rst
 delete mode 100644 Documentation/filesystems/xfs-delayed-logging-design.txt


diff --git a/Documentation/filesystems/xfs-data-structures/delayed_logging.rst 
b/Documentation/filesystems/xfs-data-structures/delayed_logging.rst
new file mode 100644
index ..a4ae343e7556
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/delayed_logging.rst
@@ -0,0 +1,828 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Delayed Logging
+---
+
+Introduction to Re-logging in XFS
+~
+
+XFS logging is a combination of logical and physical logging. Some objects,
+such as inodes and dquots, are logged in logical format where the details
+logged are made up of the changes to in-core structures rather than on-disk
+structures. Other objects - typically buffers - have their physical changes
+logged. The reason for these differences is to reduce the amount of log space
+required for objects that are frequently logged. Some parts of inodes are more
+frequently logged than others, and inodes are typically more frequently logged
+than any other object (except maybe the superblock buffer) so keeping the
+amount of metadata logged low is of prime importance.
+
+The reason that this is such a concern is that XFS allows multiple separate
+modifications to a single object to be carried in the log at any given time.
+This allows the log to avoid needing to flush each change to disk before
+recording a new change to the object. XFS does this via a method called
+"re-logging". Conceptually, this is quite simple - all it requires is that any
+new change to the object is recorded with a **new copy** of all the existing
+changes in the new transaction that is written to the log.
+
+That is, if we have a sequence of changes A through to F, and the object was
+written to disk after change D, we would see in the log the following series
+of transactions, their contents and the log sequence number (LSN) of the
+transaction:
+
+::
+
+  Transaction ContentsLSN
+   A   A   X
+   B  A+B X+n
+   C A+B+C   X+n+m
+   DA+B+C+D X+n+m+o
+
+   E   E   Y (> X+n+m+o)
+   F  E+F Y+p
+
+In other words, each time an object is relogged, the new transaction contains
+the aggregation of all the previous changes currently held only in the log.
+
+This relogging technique also allows objects to be moved forward in the log so
+that an object being relogged does not prevent the tail of the log from ever
+moving forward. This can be seen in the table above by the changing
+(increasing) LSN of each subsequent transaction - the LSN is effectively a
+direct encoding of the location in the log of the transaction.
+
+This relogging is also used to implement long-running, multiple-commit
+transactions. These transaction are known as rolling transactions, and require
+a special log reservation known as a permanent transaction reservation. A
+typical example of a rolling transaction is the removal of extents from an
+inode which can only be done at a rate of two extents per transaction because
+of reservation size limitations. Hence a rolling extent removal transaction
+keeps relogging the inode and btree buffers as they get modified in each
+removal operation. This keeps them moving forward in the log as the operation
+progresses, ensuring that current operation never gets blocked by itself if
+the log wraps around.
+
+Hence it can be seen that the relogging operation is fundamental to the
+correct working of the XFS journalling subsystem. From the above description,
+most people should be able to see why the XFS metadata operations writes so
+much to the log - repeated operations to the same objects write the same
+changes to the log over and over again. Worse is the fact that objects tend to
+get dirtier as they get relogged, so each subsequent transaction is writing
+more metadata into the log.
+
+Another feature of the XFS transaction subsystem is that most transactions are
+asynchronous. That is, they don’t commit to disk until either a log buffer is
+filled (a log buffer can hold multiple transactions) or a synchronous
+operation forces the log buffers holding the transactions to disk. This means
+that XFS is doing aggregation of transactions in memory - batching them, if
+you like - to minimise the impact of the log IO on transaction throughput.
+
+The limitation on asynchronous transaction throughput 

[PATCH 05/22] docs: add XFS shared data block chapter to DS book

2018-10-03 Thread Darrick J. Wong
From: Darrick J. Wong 

Signed-off-by: Darrick J. Wong 
---
 .../filesystems/xfs-data-structures/overview.rst   |1 
 .../filesystems/xfs-data-structures/reflink.rst|   43 
 2 files changed, 44 insertions(+)
 create mode 100644 Documentation/filesystems/xfs-data-structures/reflink.rst


diff --git a/Documentation/filesystems/xfs-data-structures/overview.rst 
b/Documentation/filesystems/xfs-data-structures/overview.rst
index 457e81c0eb40..d8d668ec6097 100644
--- a/Documentation/filesystems/xfs-data-structures/overview.rst
+++ b/Documentation/filesystems/xfs-data-structures/overview.rst
@@ -45,3 +45,4 @@ latency.
 
 .. include:: self_describing_metadata.rst
 .. include:: delayed_logging.rst
+.. include:: reflink.rst
diff --git a/Documentation/filesystems/xfs-data-structures/reflink.rst 
b/Documentation/filesystems/xfs-data-structures/reflink.rst
new file mode 100644
index ..653b3def7e6e
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/reflink.rst
@@ -0,0 +1,43 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Sharing Data Blocks
+---
+
+On a traditional filesystem, there is a 1:1 mapping between a logical block
+offset in a file and a physical block on disk, which is to say that physical
+blocks are not shared. However, there exist various use cases for being able
+to share blocks between files — deduplicating files saves space on archival
+systems; creating space-efficient clones of disk images for virtual machines
+and containers facilitates efficient datacenters; and deferring the payment of
+the allocation cost of a file system tree copy as long as possible makes
+regular work faster. In all of these cases, a write to one of the shared
+copies **must** not affect the other shared copies, which means that writes to
+shared blocks must employ a copy-on-write strategy. Sharing blocks in this
+manner is commonly referred to as "reflinking".
+
+XFS implements block sharing in a fairly straightforward manner. All existing
+data fork structures remain unchanged, save for the addition of a
+per-allocation group `reference count B+tree <#reference-count-b-tree>`__. This
+data structure tracks reference counts for all shared physical blocks, with a
+few rules to maintain compatibility with existing code: If a block is free, it
+will be tracked in the free space B+trees. If a block is owned by a single
+file, it appears in neither the free space nor the reference count B+trees. If
+a block is shared, it will appear in the reference count B+tree with a
+reference count >= 2. The first two cases are established precedent in XFS, so
+the third case is the only behavioral change.
+
+When a filesystem block is shared, the block mapping in the destination file
+is updated to point to that filesystem block and the reference count B+tree
+records are updated to reflect the increased reference count. If a shared
+block is written, a new block will be allocated, the dirty data written to
+this new block, and the file’s block mapping updated to point to the new
+block. If a shared block is unmapped, the reference count records are updated
+to reflect the decreased reference count and the block is also freed if its
+reference count becomes zero. This enables users to create space efficient
+clones of disk images and to copy filesystem subtrees quickly, using the
+standard Linux coreutils packages.
+
+Deduplication employs the same mechanism to share blocks and copy them at
+write time. However, the kernel confirms that the contents of both files are
+identical before updating the destination file’s mapping. This enables XFS to
+be used by userspace deduplication programs such as duperemove.



[PATCH 09/22] docs: add XFS btrees to the DS book

2018-10-03 Thread Darrick J. Wong
From: Darrick J. Wong 

Signed-off-by: Darrick J. Wong 
---
 .../filesystems/xfs-data-structures/btrees.rst |  197 
 .../filesystems/xfs-data-structures/globals.rst|2 
 2 files changed, 199 insertions(+)
 create mode 100644 Documentation/filesystems/xfs-data-structures/btrees.rst


diff --git a/Documentation/filesystems/xfs-data-structures/btrees.rst 
b/Documentation/filesystems/xfs-data-structures/btrees.rst
new file mode 100644
index ..e343f71b37f6
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/btrees.rst
@@ -0,0 +1,197 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Fixed Length Record B+trees
+---
+
+XFS uses b+trees to index all metadata records. This well known data structure
+is used to provide efficient random and sequential access to metadata records
+while minimizing seek times. There are two btree formats: a short format for
+records pertaining to a single allocation group, since all block pointers in
+an AG are 32-bits in size; and a long format for records pertaining to a file,
+since file data can have 64-bit block offsets. Each b+tree block is either a
+leaf node containing records, or an internal node containing keys and pointers
+to other b+tree blocks. The tree consists of a root block which may point to
+some number of other blocks; blocks in the bottom level of the b+tree contains
+only records.
+
+Leaf blocks of both types of b+trees have the same general format: a header
+describing the data in the block, and an array of records. The specific header
+formats are given in the next two sections, and the record format is provided
+by the b+tree client itself. The generic b+tree code does not have any
+specific knowledge of the record format.
+
+::
+
+++++
+| header |   record   | records... |
+++++
+
+Internal node blocks of both types of b+trees also have the same general
+format: a header describing the data in the block, an array of keys, and an
+array of pointers. Each pointer may be associated with one or two keys. The
+first key uniquely identifies the first record accessible via the leftmost
+path down the branch of the tree.
+
+If the records in a b+tree are indexed by an interval, then a range of keys
+can uniquely identify a single record. For example, if a record covers blocks
+12-16, then any one of the keys 12, 13, 14, 15, or 16 return the same record.
+In this case, the key for the record describing "12-16" is 12. If none of the
+records overlap, we only need to store one key.
+
+This is the format of a standard b+tree node:
+
+::
+
+++-+-+-+-+
+| header |   key   | keys... |   ptr   | ptrs... |
+++-+-+-+-+
+
+If the b+tree records do not overlap, performing a b+tree lookup is simple.
+Start with the root. If it is a leaf block, perform a binary search of the
+records until we find the record with a lower key than our search key. If the
+block is a node block, perform a binary search of the keys until we find a key
+lower than our search key, then follow the pointer to the next block. Repeat
+until we find a record.
+
+However, if b+tree records contain intervals and are allowed to overlap, the
+internal nodes of the b+tree become larger:
+
+::
+
+++-+--+-+-+-+-+
+| header | low key | high key | low key | high key... |   ptr   | ptrs... |
+++-+--+-+-+-+-+
+
+The low keys are exactly the same as the keys in the non-overlapping b+tree.
+High keys, however, are a little different. Recall that a record with a key
+consisting of an interval can be referenced by a number of keys. Since the low
+key of a record indexes the low end of that key range, the high key indexes
+the high end of the key range. Returning to the example above, the high key
+for the record describing "12-16" is 16. The high key recorded in a b+tree
+node is the largest of the high keys of all records accessible under the
+subtree rooted by the pointer. For a level 1 node, this is the largest high
+key in the pointed-to leaf node; for any other node, this is the largest of
+the high keys in the pointed-to node.
+
+Nodes and leaves use the same magic numbers.
+
+Short Format B+trees
+
+
+Each allocation group uses a "short format" B+tree to index various
+information about the allocation group. The structure is called short format
+because all block pointers are AG block numbers. The trees use the following
+header:
+
+.. code:: c
+
+struct xfs_btree_sblock {
+ __be32bb_magic;
+ __be16bb_level;
+ __be16bb_numrecs;
+ __be32bb_leftsib;
+ __be32 

[PATCH 07/22] docs: add XFS common types and magic numbers to DS book

2018-10-03 Thread Darrick J. Wong
From: Darrick J. Wong 

Signed-off-by: Darrick J. Wong 
---
 .../xfs-data-structures/common_types.rst   |   61 
 .../filesystems/xfs-data-structures/magic.rst  |  277 
 .../filesystems/xfs-data-structures/overview.rst   |2 
 3 files changed, 340 insertions(+)
 create mode 100644 
Documentation/filesystems/xfs-data-structures/common_types.rst
 create mode 100644 Documentation/filesystems/xfs-data-structures/magic.rst


diff --git a/Documentation/filesystems/xfs-data-structures/common_types.rst 
b/Documentation/filesystems/xfs-data-structures/common_types.rst
new file mode 100644
index ..63de847924c6
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/common_types.rst
@@ -0,0 +1,61 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Common XFS Types
+
+
+All the following XFS types can be found in xfs\_types.h. NULL values are
+always -1 on disk (ie. all bits for the value set to one).
+
+**xfs\_ino\_t**
+Unsigned 64 bit absolute `inode number <#inode-numbers>`__.
+
+**xfs\_off\_t**
+Signed 64 bit file offset.
+
+**xfs\_daddr\_t**
+Signed 64 bit disk address (sectors).
+
+**xfs\_agnumber\_t**
+Unsigned 32 bit `AG number <#allocation-groups>`__.
+
+**xfs\_agblock\_t**
+Unsigned 32 bit AG relative block number.
+
+**xfs\_extlen\_t**
+Unsigned 32 bit `extent <#data-extents>`__ length in blocks.
+
+**xfs\_extnum\_t**
+Signed 32 bit number of extents in a data fork.
+
+**xfs\_aextnum\_t**
+Signed 16 bit number of extents in an attribute fork.
+
+**xfs\_dablk\_t**
+Unsigned 32 bit block number for `directories <#directories>`__ and
+`extended attributes <#extended-attributes>`__.
+
+**xfs\_dahash\_t**
+Unsigned 32 bit hash of a directory file name or extended attribute name.
+
+**xfs\_fsblock\_t**
+Unsigned 64 bit filesystem block number combining `AG
+number <#allocation-groups>`__ and block offset into the AG.
+
+**xfs\_rfsblock\_t**
+Unsigned 64 bit raw filesystem block number.
+
+**xfs\_rtblock\_t**
+Unsigned 64 bit extent number in the `real-time <#real-time-devices>`__
+sub-volume.
+
+**xfs\_fileoff\_t**
+Unsigned 64 bit block offset into a file.
+
+**xfs\_filblks\_t**
+Unsigned 64 bit block count for a file.
+
+**uuid\_t**
+16-byte universally unique identifier (UUID).
+
+**xfs\_fsize\_t**
+Signed 64 bit byte size of a file.
diff --git a/Documentation/filesystems/xfs-data-structures/magic.rst 
b/Documentation/filesystems/xfs-data-structures/magic.rst
new file mode 100644
index ..f5e57581645d
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/magic.rst
@@ -0,0 +1,277 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Magic Numbers
+-
+
+These are the magic numbers that are known to XFS, along with links to the
+relevant chapters. Magic numbers tend to have consistent locations:
+
+-  32-bit magic numbers are always at offset zero in the block.
+
+-  16-bit magic numbers for the directory and attribute B+tree are at offset
+   eight.
+
+-  The quota magic number is at offset zero.
+
+-  The inode magic is at the beginning of each inode.
+
+.. list-table::
+   :widths: 28 12 8 34
+   :header-rows: 1
+
+   * - Flag
+ - Hexadecimal
+ - ASCII
+ - Data structure
+   * - XFS_SB_MAGIC
+ - 0x58465342
+ - XFSB
+ - `Superblock <#superblocks>`__
+   * - XFS_AGF_MAGIC
+ - 0x58414746
+ - XAGF
+ - `Free Space <#ag-free-space-block>`__
+   * - XFS_AGI_MAGIC
+ - 0x58414749
+ - XAGI
+ - `Inode Information <#inode-information>`__
+   * - XFS_AGFL_MAGIC
+ - 0x5841464c
+ - XAFL
+ - `Free Space List <#ag-free-list>`__, v5 only
+   * - XFS_DINODE_MAGIC
+ - 0x494e
+ - IN
+ - `Inodes <#inode-core>`__
+   * - XFS_DQUOT_MAGIC
+ - 0x4451
+ - DQ
+ - `Quota Inodes <#quota-inodes>`__
+   * - XFS_SYMLINK_MAGIC
+ - 0x58534c4d
+ - XSLM
+ - `Symbolic Links <#extent-symbolic-links>`__
+   * - XFS_ABTB_MAGIC
+ - 0x41425442
+ - ABTB
+ - `Free Space by Block B+tree <#ag-free-space-b-trees>`__
+   * - XFS_ABTB_CRC_MAGIC
+ - 0x41423342
+ - AB3B
+ - `Free Space by Block B+tree <#ag-free-space-b-trees>`__, v5 only
+   * - XFS_ABTC_MAGIC
+ - 0x41425443
+ - ABTC
+ - `Free Space by Size B+tree <#ag-free-space-b-trees>`__
+   * - XFS_ABTC_CRC_MAGIC
+ - 0x41423343
+ - AB3C
+ - `Free Space by Size B+tree <#ag-free-space-b-trees>`__, v5 only
+   * - XFS_IBT_MAGIC
+ - 0x49414254
+ - IABT
+ - `Inode B+tree <#inode-b-trees>`__
+   * - XFS_IBT_CRC_MAGIC
+ - 0x49414233
+ - IAB3
+ - `Inode B+tree <#inode-b-trees>`__, v5 only
+   * - XFS_FIBT_MAGIC
+ - 0x46494254
+ - FIBT
+ - `Free Inode B+tree <#inode-b-trees>`__
+   * - XFS_FIBT_CRC_MAGIC
+ - 0x46494233
+ - FIB3

[PATCH v2 00/22] xfs-4.20: major documentation surgery

2018-10-03 Thread Darrick J. Wong
Hi all,

This series converts the existing in-kernel xfs documentation to rst
format, links it in with the rest of the kernel's rst documetation, and
then begins pulling in the contents of the Data Structures & Algorithms
book from the xfs-documentation git tree.  No changes are made to the
text during the import process except to fix things that the conversion
process (asciidoctor + pandoc) didn't do correctly.  The goal of this
series is to tie together the XFS code with the on-disk format
documentation for the features supported by the code.

I've built the docs and put them here, in case you hate reading rst:
https://djwong.org/docs/kdoc/admin-guide/xfs.html
https://djwong.org/docs/kdoc/filesystems/xfs-data-structures/index.html

I've posted a branch here because the png import patch is huge:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=docs-4.20-merge

The patchset should apply cleanly against 4.19-rc6.  Comments and
questions are, as always, welcome.

--D


[PATCH 01/22] docs: add skeleton of XFS Data Structures and Algorithms book

2018-10-03 Thread Darrick J. Wong
From: Darrick J. Wong 

Start adding the main TOC of the XFS data structures and algorithms
book.  We'll add the individual sections in later patches.

Signed-off-by: Darrick J. Wong 
---
 Documentation/conf.py  |2 
 .../filesystems/xfs-data-structures/about.rst  |  123 
 .../filesystems/xfs-data-structures/auxiliary.rst  |4 +
 .../filesystems/xfs-data-structures/dynamic.rst|4 +
 .../filesystems/xfs-data-structures/globals.rst|4 +
 .../filesystems/xfs-data-structures/index.rst  |   15 ++
 .../filesystems/xfs-data-structures/overview.rst   |   44 +++
 Documentation/index.rst|1 
 8 files changed, 197 insertions(+)
 create mode 100644 Documentation/filesystems/xfs-data-structures/about.rst
 create mode 100644 Documentation/filesystems/xfs-data-structures/auxiliary.rst
 create mode 100644 Documentation/filesystems/xfs-data-structures/dynamic.rst
 create mode 100644 Documentation/filesystems/xfs-data-structures/globals.rst
 create mode 100644 Documentation/filesystems/xfs-data-structures/index.rst
 create mode 100644 Documentation/filesystems/xfs-data-structures/overview.rst


diff --git a/Documentation/conf.py b/Documentation/conf.py
index add6788bbb8c..fbf8f5dce7d9 100644
--- a/Documentation/conf.py
+++ b/Documentation/conf.py
@@ -383,6 +383,8 @@ latex_documents = [
  'The kernel development community', 'manual'),
 ('admin-guide/xfs', 'xfs-admin-guide.tex',
  'XFS Administration Guide', 'XFS Community', 'manual'),
+('filesystems/xfs-data-structures/index', 'xfs-data-structures.tex',
+ 'XFS Data Structures and Algorithms', 'XFS Community', 'manual'),
 ('filesystems/index', 'filesystems.tex', 'Linux Filesystems API',
  'The kernel development community', 'manual'),
 ('filesystems/ext4/index', 'ext4.tex', 'ext4 Filesystem',
diff --git a/Documentation/filesystems/xfs-data-structures/about.rst 
b/Documentation/filesystems/xfs-data-structures/about.rst
new file mode 100644
index ..7df40b637e2e
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/about.rst
@@ -0,0 +1,123 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+About this Book
+===
+
+XFS is a high performance filesystem which was designed to maximize
+parallel throughput and to scale up to extremely large 64-bit storage
+systems. Originally developed by SGI in October 1993 for IRIX, XFS can
+handle large files, large filesystems, many inodes, large directories,
+large file attributes, and large allocations. Filesystems are optimized
+for parallel access by splitting the storage device into semi-autonomous
+allocation groups. XFS employs branching trees (B+ trees) to facilitate
+fast searches of large lists; it also uses delayed extent-based
+allocation to improve data contiguity and IO performance.
+
+This document describes the on-disk layout of an XFS filesystem and how
+to use the debugging tools ``xfs_db`` and ``xfs_logprint`` to inspect
+the metadata structures. It also describes how on-disk metadata relates
+to the higher level design goals.
+
+This book’s source code is available in the Linux kernel git tree.
+Feedback should be sent to the XFS mailing list, currently at:
+``linux-...@vger.kernel.org``.
+
+**Note**
+
+All fields in XFS metadata structures are in big-endian byte order
+except for log items which are formatted in host order.
+
+Copyright
+-
+© Copyright 2006 Silicon Graphics Inc. All rights reserved.  Permission is
+granted to copy, distribute, and/or modify this document under the terms of the
+Creative Commons Attribution-Share Alike, Version 3.0 or any later version
+published by the Creative Commons Corp. A copy of the license is available at
+http://creativecommons.org/licenses/by-sa/3.0/us/ .
+
+Change Log
+--
+
+.. list-table::
+   :widths: 8 12 14 46
+   :header-rows: 1
+
+   * - Version
+ - Date
+ - Author
+ - Description
+
+   * - 0.1
+ - 2006
+ - Silicon Graphics, Inc.
+ - Initial Release
+
+   * - 1.0
+ - Fri Jul 03 2009
+ - Ryan Lerch
+ - Publican Conversion
+
+   * - 1.1
+ - March 2010
+ - Eric Sandeen
+ - Community Release
+
+   * - 1.99
+ - February 2014
+ - Dave Chinner
+ - AsciiDoc Conversion
+
+   * - 3.0
+ - October 2015
+ - Darrick J. Wong
+ - Miscellaneous fixes.
+   Add missing field definitions.
+   Add some missing xfs_db examples.
+   Add an overview of XFS.
+   Document the journal format.
+   Document the realtime device.
+
+   * - 3.1
+ - October 2015
+ - Darrick J. Wong
+ - Add v5 fields.
+   Discuss metadata integrity.
+   Document the free inode B+tree.
+   Create an index of magic numbers.
+   Document sparse inodes.
+
+   * - 3.14
+ - January 2016
+ - Darrick J. Wong
+ - Document disk format change testing.
+
+   * - 3.141
+ - June 2016
+ - Darrick J. Wong

[PATCH 07/22] docs: add XFS common types and magic numbers to DS book

2018-10-03 Thread Darrick J. Wong
From: Darrick J. Wong 

Signed-off-by: Darrick J. Wong 
---
 .../xfs-data-structures/common_types.rst   |   61 
 .../filesystems/xfs-data-structures/magic.rst  |  277 
 .../filesystems/xfs-data-structures/overview.rst   |2 
 3 files changed, 340 insertions(+)
 create mode 100644 
Documentation/filesystems/xfs-data-structures/common_types.rst
 create mode 100644 Documentation/filesystems/xfs-data-structures/magic.rst


diff --git a/Documentation/filesystems/xfs-data-structures/common_types.rst 
b/Documentation/filesystems/xfs-data-structures/common_types.rst
new file mode 100644
index ..63de847924c6
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/common_types.rst
@@ -0,0 +1,61 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Common XFS Types
+
+
+All the following XFS types can be found in xfs\_types.h. NULL values are
+always -1 on disk (ie. all bits for the value set to one).
+
+**xfs\_ino\_t**
+Unsigned 64 bit absolute `inode number <#inode-numbers>`__.
+
+**xfs\_off\_t**
+Signed 64 bit file offset.
+
+**xfs\_daddr\_t**
+Signed 64 bit disk address (sectors).
+
+**xfs\_agnumber\_t**
+Unsigned 32 bit `AG number <#allocation-groups>`__.
+
+**xfs\_agblock\_t**
+Unsigned 32 bit AG relative block number.
+
+**xfs\_extlen\_t**
+Unsigned 32 bit `extent <#data-extents>`__ length in blocks.
+
+**xfs\_extnum\_t**
+Signed 32 bit number of extents in a data fork.
+
+**xfs\_aextnum\_t**
+Signed 16 bit number of extents in an attribute fork.
+
+**xfs\_dablk\_t**
+Unsigned 32 bit block number for `directories <#directories>`__ and
+`extended attributes <#extended-attributes>`__.
+
+**xfs\_dahash\_t**
+Unsigned 32 bit hash of a directory file name or extended attribute name.
+
+**xfs\_fsblock\_t**
+Unsigned 64 bit filesystem block number combining `AG
+number <#allocation-groups>`__ and block offset into the AG.
+
+**xfs\_rfsblock\_t**
+Unsigned 64 bit raw filesystem block number.
+
+**xfs\_rtblock\_t**
+Unsigned 64 bit extent number in the `real-time <#real-time-devices>`__
+sub-volume.
+
+**xfs\_fileoff\_t**
+Unsigned 64 bit block offset into a file.
+
+**xfs\_filblks\_t**
+Unsigned 64 bit block count for a file.
+
+**uuid\_t**
+16-byte universally unique identifier (UUID).
+
+**xfs\_fsize\_t**
+Signed 64 bit byte size of a file.
diff --git a/Documentation/filesystems/xfs-data-structures/magic.rst 
b/Documentation/filesystems/xfs-data-structures/magic.rst
new file mode 100644
index ..f5e57581645d
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/magic.rst
@@ -0,0 +1,277 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Magic Numbers
+-
+
+These are the magic numbers that are known to XFS, along with links to the
+relevant chapters. Magic numbers tend to have consistent locations:
+
+-  32-bit magic numbers are always at offset zero in the block.
+
+-  16-bit magic numbers for the directory and attribute B+tree are at offset
+   eight.
+
+-  The quota magic number is at offset zero.
+
+-  The inode magic is at the beginning of each inode.
+
+.. list-table::
+   :widths: 28 12 8 34
+   :header-rows: 1
+
+   * - Flag
+ - Hexadecimal
+ - ASCII
+ - Data structure
+   * - XFS_SB_MAGIC
+ - 0x58465342
+ - XFSB
+ - `Superblock <#superblocks>`__
+   * - XFS_AGF_MAGIC
+ - 0x58414746
+ - XAGF
+ - `Free Space <#ag-free-space-block>`__
+   * - XFS_AGI_MAGIC
+ - 0x58414749
+ - XAGI
+ - `Inode Information <#inode-information>`__
+   * - XFS_AGFL_MAGIC
+ - 0x5841464c
+ - XAFL
+ - `Free Space List <#ag-free-list>`__, v5 only
+   * - XFS_DINODE_MAGIC
+ - 0x494e
+ - IN
+ - `Inodes <#inode-core>`__
+   * - XFS_DQUOT_MAGIC
+ - 0x4451
+ - DQ
+ - `Quota Inodes <#quota-inodes>`__
+   * - XFS_SYMLINK_MAGIC
+ - 0x58534c4d
+ - XSLM
+ - `Symbolic Links <#extent-symbolic-links>`__
+   * - XFS_ABTB_MAGIC
+ - 0x41425442
+ - ABTB
+ - `Free Space by Block B+tree <#ag-free-space-b-trees>`__
+   * - XFS_ABTB_CRC_MAGIC
+ - 0x41423342
+ - AB3B
+ - `Free Space by Block B+tree <#ag-free-space-b-trees>`__, v5 only
+   * - XFS_ABTC_MAGIC
+ - 0x41425443
+ - ABTC
+ - `Free Space by Size B+tree <#ag-free-space-b-trees>`__
+   * - XFS_ABTC_CRC_MAGIC
+ - 0x41423343
+ - AB3C
+ - `Free Space by Size B+tree <#ag-free-space-b-trees>`__, v5 only
+   * - XFS_IBT_MAGIC
+ - 0x49414254
+ - IABT
+ - `Inode B+tree <#inode-b-trees>`__
+   * - XFS_IBT_CRC_MAGIC
+ - 0x49414233
+ - IAB3
+ - `Inode B+tree <#inode-b-trees>`__, v5 only
+   * - XFS_FIBT_MAGIC
+ - 0x46494254
+ - FIBT
+ - `Free Inode B+tree <#inode-b-trees>`__
+   * - XFS_FIBT_CRC_MAGIC
+ - 0x46494233
+ - FIB3

[PATCH 08/22] docs: add XFS testing chapter to the DS book

2018-10-03 Thread Darrick J. Wong
From: Darrick J. Wong 

Signed-off-by: Darrick J. Wong 
---
 .../filesystems/xfs-data-structures/overview.rst   |1 +
 .../filesystems/xfs-data-structures/testing.rst|   25 
 2 files changed, 26 insertions(+)
 create mode 100644 Documentation/filesystems/xfs-data-structures/testing.rst


diff --git a/Documentation/filesystems/xfs-data-structures/overview.rst 
b/Documentation/filesystems/xfs-data-structures/overview.rst
index 23eb71d65c93..d6711dc653d8 100644
--- a/Documentation/filesystems/xfs-data-structures/overview.rst
+++ b/Documentation/filesystems/xfs-data-structures/overview.rst
@@ -49,3 +49,4 @@ latency.
 .. include:: reconstruction.rst
 .. include:: common_types.rst
 .. include:: magic.rst
+.. include:: testing.rst
diff --git a/Documentation/filesystems/xfs-data-structures/testing.rst 
b/Documentation/filesystems/xfs-data-structures/testing.rst
new file mode 100644
index ..3d3386854408
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/testing.rst
@@ -0,0 +1,25 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Testing Filesystem Changes
+--
+
+People put a lot of trust in filesystems to preserve their data in a reliable
+fashion. To that end, it is very important that users and developers have
+access to a suite of regression tests that can be used to prove correct
+operation of any given filesystem code, or to analyze failures to fix problems
+found in the code. The XFS regression test suite, xfstests, is hosted at
+``git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git``. Most tests apply to
+filesystems in general, but the suite also contains tests for features
+specific to each filesystem.
+
+When fixing bugs, it is important to provide a testcase exposing the bug so
+that the developers can avoid a future re-occurrence of the regression.
+Furthermore, if you’re developing a new user-visible feature for XFS, please
+help the rest of the development community to sustain and maintain the whole
+codebase by providing generous test coverage to check its behavior.
+
+When altering, adding, or removing an on-disk data structure, please remember
+to update both the in-kernel structure size checks in xfs\_ondisk.h and to
+ensure that your changes are reflected in xfstest xfs/122. These regression
+tests enable us to detect compiler bugs, alignment problems, and anything else
+that might result in the creation of incompatible filesystem images.



[PATCH 04/22] docs: add XFS delayed logging design doc to DS book

2018-10-03 Thread Darrick J. Wong
From: Darrick J. Wong 

Signed-off-by: Darrick J. Wong 
---
 .../xfs-data-structures/delayed_logging.rst|  828 
 .../filesystems/xfs-data-structures/overview.rst   |1 
 .../filesystems/xfs-delayed-logging-design.txt |  793 ---
 3 files changed, 829 insertions(+), 793 deletions(-)
 create mode 100644 
Documentation/filesystems/xfs-data-structures/delayed_logging.rst
 delete mode 100644 Documentation/filesystems/xfs-delayed-logging-design.txt


diff --git a/Documentation/filesystems/xfs-data-structures/delayed_logging.rst 
b/Documentation/filesystems/xfs-data-structures/delayed_logging.rst
new file mode 100644
index ..a4ae343e7556
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/delayed_logging.rst
@@ -0,0 +1,828 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Delayed Logging
+---
+
+Introduction to Re-logging in XFS
+~
+
+XFS logging is a combination of logical and physical logging. Some objects,
+such as inodes and dquots, are logged in logical format where the details
+logged are made up of the changes to in-core structures rather than on-disk
+structures. Other objects - typically buffers - have their physical changes
+logged. The reason for these differences is to reduce the amount of log space
+required for objects that are frequently logged. Some parts of inodes are more
+frequently logged than others, and inodes are typically more frequently logged
+than any other object (except maybe the superblock buffer) so keeping the
+amount of metadata logged low is of prime importance.
+
+The reason that this is such a concern is that XFS allows multiple separate
+modifications to a single object to be carried in the log at any given time.
+This allows the log to avoid needing to flush each change to disk before
+recording a new change to the object. XFS does this via a method called
+"re-logging". Conceptually, this is quite simple - all it requires is that any
+new change to the object is recorded with a **new copy** of all the existing
+changes in the new transaction that is written to the log.
+
+That is, if we have a sequence of changes A through to F, and the object was
+written to disk after change D, we would see in the log the following series
+of transactions, their contents and the log sequence number (LSN) of the
+transaction:
+
+::
+
+  Transaction ContentsLSN
+   A   A   X
+   B  A+B X+n
+   C A+B+C   X+n+m
+   DA+B+C+D X+n+m+o
+
+   E   E   Y (> X+n+m+o)
+   F  E+F Y+p
+
+In other words, each time an object is relogged, the new transaction contains
+the aggregation of all the previous changes currently held only in the log.
+
+This relogging technique also allows objects to be moved forward in the log so
+that an object being relogged does not prevent the tail of the log from ever
+moving forward. This can be seen in the table above by the changing
+(increasing) LSN of each subsequent transaction - the LSN is effectively a
+direct encoding of the location in the log of the transaction.
+
+This relogging is also used to implement long-running, multiple-commit
+transactions. These transaction are known as rolling transactions, and require
+a special log reservation known as a permanent transaction reservation. A
+typical example of a rolling transaction is the removal of extents from an
+inode which can only be done at a rate of two extents per transaction because
+of reservation size limitations. Hence a rolling extent removal transaction
+keeps relogging the inode and btree buffers as they get modified in each
+removal operation. This keeps them moving forward in the log as the operation
+progresses, ensuring that current operation never gets blocked by itself if
+the log wraps around.
+
+Hence it can be seen that the relogging operation is fundamental to the
+correct working of the XFS journalling subsystem. From the above description,
+most people should be able to see why the XFS metadata operations writes so
+much to the log - repeated operations to the same objects write the same
+changes to the log over and over again. Worse is the fact that objects tend to
+get dirtier as they get relogged, so each subsequent transaction is writing
+more metadata into the log.
+
+Another feature of the XFS transaction subsystem is that most transactions are
+asynchronous. That is, they don’t commit to disk until either a log buffer is
+filled (a log buffer can hold multiple transactions) or a synchronous
+operation forces the log buffers holding the transactions to disk. This means
+that XFS is doing aggregation of transactions in memory - batching them, if
+you like - to minimise the impact of the log IO on transaction throughput.
+
+The limitation on asynchronous transaction throughput 

[PATCH 06/22] docs: add XFS online repair chapter to DS book

2018-10-03 Thread Darrick J. Wong
From: Darrick J. Wong 

Signed-off-by: Darrick J. Wong 
---
 .../filesystems/xfs-data-structures/overview.rst   |1 
 .../xfs-data-structures/reconstruction.rst |   68 
 2 files changed, 69 insertions(+)
 create mode 100644 
Documentation/filesystems/xfs-data-structures/reconstruction.rst


diff --git a/Documentation/filesystems/xfs-data-structures/overview.rst 
b/Documentation/filesystems/xfs-data-structures/overview.rst
index d8d668ec6097..b1b3f711638b 100644
--- a/Documentation/filesystems/xfs-data-structures/overview.rst
+++ b/Documentation/filesystems/xfs-data-structures/overview.rst
@@ -46,3 +46,4 @@ latency.
 .. include:: self_describing_metadata.rst
 .. include:: delayed_logging.rst
 .. include:: reflink.rst
+.. include:: reconstruction.rst
diff --git a/Documentation/filesystems/xfs-data-structures/reconstruction.rst 
b/Documentation/filesystems/xfs-data-structures/reconstruction.rst
new file mode 100644
index ..10a7a728c50c
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/reconstruction.rst
@@ -0,0 +1,68 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Metadata Reconstruction
+---
+
+**Note**
+
+This is a theoretical discussion of how reconstruction could work; none of
+this is implemented as of 2018.
+
+A simple UNIX filesystem can be thought of in terms of a directed acyclic
+graph. To a first approximation, there exists a root directory node, which
+points to other nodes. Those other nodes can themselves be directories or they
+can be files. Each file, in turn, points to data blocks.
+
+XFS adds a few more details to this picture:
+
+-  The real root(s) of an XFS filesystem are the allocation group headers
+   (superblock, AGF, AGI, AGFL).
+
+-  Each allocation group’s headers point to various per-AG B+trees (free
+   space, inode, free inodes, free list, etc.)
+
+-  The free space B+trees point to unused extents;
+
+-  The inode B+trees point to blocks containing inode chunks;
+
+-  All superblocks point to the root directory and the log;
+
+-  Hardlinks mean that multiple directories can point to a single file node;
+
+-  File data block pointers are indexed by file offset;
+
+-  Files and directories can have a second collection of pointers to data
+   blocks which contain extended attributes;
+
+-  Large directories require multiple data blocks to store all the
+   subpointers;
+
+-  Still larger directories use high-offset data blocks to store a B+tree of
+   hashes to directory entries;
+
+-  Large extended attribute forks similarly use high-offset data blocks to
+   store a B+tree of hashes to attribute keys; and
+
+-  Symbolic links can point to data blocks.
+
+The beauty of this massive graph structure is that under normal circumstances,
+everything known to the filesystem is discoverable (access controls
+notwithstanding) from the root. The major weakness of this structure of course
+is that breaking a edge in the graph can render entire subtrees inaccessible.
+xfs\_repair “recovers” from broken directories by scanning for unlinked inodes
+and connecting them to /lost+found, but this isn’t sufficiently general to
+recover from breaks in other parts of the graph structure. Wouldn’t it be
+useful to have back pointers as a secondary data structure? The current repair
+strategy is to reconstruct whatever can be rebuilt, but to scrap anything that
+doesn’t check out.
+
+The `reverse-mapping B+tree <#reverse-mapping-b-tree>`__ fills in part of the
+puzzle. Since it contains copies of every entry in each inode’s data and
+attribute forks, we can fix a corrupted block map with these records.
+Furthermore, if the inode B+trees become corrupt, it is possible to visit all
+inode chunks using the reverse-mapping data. Should XFS ever gain the ability
+to store parent directory information in each inode, it also becomes possible
+to resurrect damaged directory trees, which should reduce the complaints about
+inodes ending up in /lost+found. Everything else in the per-AG primary
+metadata can already be reconstructed via xfs\_repair. Hopefully,
+reconstruction will not turn out to be a fool’s errand.



[PATCH 05/22] docs: add XFS shared data block chapter to DS book

2018-10-03 Thread Darrick J. Wong
From: Darrick J. Wong 

Signed-off-by: Darrick J. Wong 
---
 .../filesystems/xfs-data-structures/overview.rst   |1 
 .../filesystems/xfs-data-structures/reflink.rst|   43 
 2 files changed, 44 insertions(+)
 create mode 100644 Documentation/filesystems/xfs-data-structures/reflink.rst


diff --git a/Documentation/filesystems/xfs-data-structures/overview.rst 
b/Documentation/filesystems/xfs-data-structures/overview.rst
index 457e81c0eb40..d8d668ec6097 100644
--- a/Documentation/filesystems/xfs-data-structures/overview.rst
+++ b/Documentation/filesystems/xfs-data-structures/overview.rst
@@ -45,3 +45,4 @@ latency.
 
 .. include:: self_describing_metadata.rst
 .. include:: delayed_logging.rst
+.. include:: reflink.rst
diff --git a/Documentation/filesystems/xfs-data-structures/reflink.rst 
b/Documentation/filesystems/xfs-data-structures/reflink.rst
new file mode 100644
index ..653b3def7e6e
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/reflink.rst
@@ -0,0 +1,43 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Sharing Data Blocks
+---
+
+On a traditional filesystem, there is a 1:1 mapping between a logical block
+offset in a file and a physical block on disk, which is to say that physical
+blocks are not shared. However, there exist various use cases for being able
+to share blocks between files — deduplicating files saves space on archival
+systems; creating space-efficient clones of disk images for virtual machines
+and containers facilitates efficient datacenters; and deferring the payment of
+the allocation cost of a file system tree copy as long as possible makes
+regular work faster. In all of these cases, a write to one of the shared
+copies **must** not affect the other shared copies, which means that writes to
+shared blocks must employ a copy-on-write strategy. Sharing blocks in this
+manner is commonly referred to as "reflinking".
+
+XFS implements block sharing in a fairly straightforward manner. All existing
+data fork structures remain unchanged, save for the addition of a
+per-allocation group `reference count B+tree <#reference-count-b-tree>`__. This
+data structure tracks reference counts for all shared physical blocks, with a
+few rules to maintain compatibility with existing code: If a block is free, it
+will be tracked in the free space B+trees. If a block is owned by a single
+file, it appears in neither the free space nor the reference count B+trees. If
+a block is shared, it will appear in the reference count B+tree with a
+reference count >= 2. The first two cases are established precedent in XFS, so
+the third case is the only behavioral change.
+
+When a filesystem block is shared, the block mapping in the destination file
+is updated to point to that filesystem block and the reference count B+tree
+records are updated to reflect the increased reference count. If a shared
+block is written, a new block will be allocated, the dirty data written to
+this new block, and the file’s block mapping updated to point to the new
+block. If a shared block is unmapped, the reference count records are updated
+to reflect the decreased reference count and the block is also freed if its
+reference count becomes zero. This enables users to create space efficient
+clones of disk images and to copy filesystem subtrees quickly, using the
+standard Linux coreutils packages.
+
+Deduplication employs the same mechanism to share blocks and copy them at
+write time. However, the kernel confirms that the contents of both files are
+identical before updating the destination file’s mapping. This enables XFS to
+be used by userspace deduplication programs such as duperemove.



[PATCH 01/22] docs: add skeleton of XFS Data Structures and Algorithms book

2018-10-03 Thread Darrick J. Wong
From: Darrick J. Wong 

Start adding the main TOC of the XFS data structures and algorithms
book.  We'll add the individual sections in later patches.

Signed-off-by: Darrick J. Wong 
---
 Documentation/conf.py  |2 
 .../filesystems/xfs-data-structures/about.rst  |  123 
 .../filesystems/xfs-data-structures/auxiliary.rst  |4 +
 .../filesystems/xfs-data-structures/dynamic.rst|4 +
 .../filesystems/xfs-data-structures/globals.rst|4 +
 .../filesystems/xfs-data-structures/index.rst  |   15 ++
 .../filesystems/xfs-data-structures/overview.rst   |   44 +++
 Documentation/index.rst|1 
 8 files changed, 197 insertions(+)
 create mode 100644 Documentation/filesystems/xfs-data-structures/about.rst
 create mode 100644 Documentation/filesystems/xfs-data-structures/auxiliary.rst
 create mode 100644 Documentation/filesystems/xfs-data-structures/dynamic.rst
 create mode 100644 Documentation/filesystems/xfs-data-structures/globals.rst
 create mode 100644 Documentation/filesystems/xfs-data-structures/index.rst
 create mode 100644 Documentation/filesystems/xfs-data-structures/overview.rst


diff --git a/Documentation/conf.py b/Documentation/conf.py
index add6788bbb8c..fbf8f5dce7d9 100644
--- a/Documentation/conf.py
+++ b/Documentation/conf.py
@@ -383,6 +383,8 @@ latex_documents = [
  'The kernel development community', 'manual'),
 ('admin-guide/xfs', 'xfs-admin-guide.tex',
  'XFS Administration Guide', 'XFS Community', 'manual'),
+('filesystems/xfs-data-structures/index', 'xfs-data-structures.tex',
+ 'XFS Data Structures and Algorithms', 'XFS Community', 'manual'),
 ('filesystems/index', 'filesystems.tex', 'Linux Filesystems API',
  'The kernel development community', 'manual'),
 ('filesystems/ext4/index', 'ext4.tex', 'ext4 Filesystem',
diff --git a/Documentation/filesystems/xfs-data-structures/about.rst 
b/Documentation/filesystems/xfs-data-structures/about.rst
new file mode 100644
index ..7df40b637e2e
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/about.rst
@@ -0,0 +1,123 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+About this Book
+===
+
+XFS is a high performance filesystem which was designed to maximize
+parallel throughput and to scale up to extremely large 64-bit storage
+systems. Originally developed by SGI in October 1993 for IRIX, XFS can
+handle large files, large filesystems, many inodes, large directories,
+large file attributes, and large allocations. Filesystems are optimized
+for parallel access by splitting the storage device into semi-autonomous
+allocation groups. XFS employs branching trees (B+ trees) to facilitate
+fast searches of large lists; it also uses delayed extent-based
+allocation to improve data contiguity and IO performance.
+
+This document describes the on-disk layout of an XFS filesystem and how
+to use the debugging tools ``xfs_db`` and ``xfs_logprint`` to inspect
+the metadata structures. It also describes how on-disk metadata relates
+to the higher level design goals.
+
+This book’s source code is available in the Linux kernel git tree.
+Feedback should be sent to the XFS mailing list, currently at:
+``linux-...@vger.kernel.org``.
+
+**Note**
+
+All fields in XFS metadata structures are in big-endian byte order
+except for log items which are formatted in host order.
+
+Copyright
+-
+© Copyright 2006 Silicon Graphics Inc. All rights reserved.  Permission is
+granted to copy, distribute, and/or modify this document under the terms of the
+Creative Commons Attribution-Share Alike, Version 3.0 or any later version
+published by the Creative Commons Corp. A copy of the license is available at
+http://creativecommons.org/licenses/by-sa/3.0/us/ .
+
+Change Log
+--
+
+.. list-table::
+   :widths: 8 12 14 46
+   :header-rows: 1
+
+   * - Version
+ - Date
+ - Author
+ - Description
+
+   * - 0.1
+ - 2006
+ - Silicon Graphics, Inc.
+ - Initial Release
+
+   * - 1.0
+ - Fri Jul 03 2009
+ - Ryan Lerch
+ - Publican Conversion
+
+   * - 1.1
+ - March 2010
+ - Eric Sandeen
+ - Community Release
+
+   * - 1.99
+ - February 2014
+ - Dave Chinner
+ - AsciiDoc Conversion
+
+   * - 3.0
+ - October 2015
+ - Darrick J. Wong
+ - Miscellaneous fixes.
+   Add missing field definitions.
+   Add some missing xfs_db examples.
+   Add an overview of XFS.
+   Document the journal format.
+   Document the realtime device.
+
+   * - 3.1
+ - October 2015
+ - Darrick J. Wong
+ - Add v5 fields.
+   Discuss metadata integrity.
+   Document the free inode B+tree.
+   Create an index of magic numbers.
+   Document sparse inodes.
+
+   * - 3.14
+ - January 2016
+ - Darrick J. Wong
+ - Document disk format change testing.
+
+   * - 3.141
+ - June 2016
+ - Darrick J. Wong

[PATCH 03/22] docs: add XFS self-describing metadata integrity doc to DS book

2018-10-03 Thread Darrick J. Wong
From: Darrick J. Wong 

Signed-off-by: Darrick J. Wong 
---
 .../filesystems/xfs-data-structures/overview.rst   |2 
 .../self_describing_metadata.rst   |  402 
 .../filesystems/xfs-self-describing-metadata.txt   |  350 -
 3 files changed, 404 insertions(+), 350 deletions(-)
 create mode 100644 
Documentation/filesystems/xfs-data-structures/self_describing_metadata.rst
 delete mode 100644 Documentation/filesystems/xfs-self-describing-metadata.txt


diff --git a/Documentation/filesystems/xfs-data-structures/overview.rst 
b/Documentation/filesystems/xfs-data-structures/overview.rst
index 43b48f30f7e8..8b3de9abcf39 100644
--- a/Documentation/filesystems/xfs-data-structures/overview.rst
+++ b/Documentation/filesystems/xfs-data-structures/overview.rst
@@ -42,3 +42,5 @@ filesystem operations can be carried out atomically in the 
case of a crash.
 Furthermore, there is the concept of a real-time device wherein allocations
 are tracked more simply and in larger chunks to reduce jitter in allocation
 latency.
+
+.. include:: self_describing_metadata.rst
diff --git 
a/Documentation/filesystems/xfs-data-structures/self_describing_metadata.rst 
b/Documentation/filesystems/xfs-data-structures/self_describing_metadata.rst
new file mode 100644
index ..f9d41c76e1d5
--- /dev/null
+++ b/Documentation/filesystems/xfs-data-structures/self_describing_metadata.rst
@@ -0,0 +1,402 @@
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+Metadata Integrity
+--
+
+Introduction
+
+
+The largest scalability problem facing XFS is not one of algorithmic
+scalability, but of verification of the filesystem structure. Scalabilty of
+the structures and indexes on disk and the algorithms for iterating them are
+adequate for supporting PB scale filesystems with billions of inodes, however
+it is this very scalability that causes the verification problem.
+
+Almost all metadata on XFS is dynamically allocated. The only fixed location
+metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all
+other metadata structures need to be discovered by walking the filesystem
+structure in different ways. While this is already done by userspace tools for
+validating and repairing the structure, there are limits to what they can
+verify, and this in turn limits the supportable size of an XFS filesystem.
+
+For example, it is entirely possible to manually use xfs\_db and a bit of
+scripting to analyse the structure of a 100TB filesystem when trying to
+determine the root cause of a corruption problem, but it is still mainly a
+manual task of verifying that things like single bit errors or misplaced
+writes weren’t the ultimate cause of a corruption event. It may take a few
+hours to a few days to perform such forensic analysis, so for at this scale
+root cause analysis is entirely possible.
+
+However, if we scale the filesystem up to 1PB, we now have 10x as much
+metadata to analyse and so that analysis blows out towards weeks/months of
+forensic work. Most of the analysis work is slow and tedious, so as the amount
+of analysis goes up, the more likely that the cause will be lost in the noise.
+Hence the primary concern for supporting PB scale filesystems is minimising
+the time and effort required for basic forensic analysis of the filesystem
+structure.
+
+Therefore, the version 5 disk format introduced larger headers for all
+metadata types, which enable the filesystem to check information being read
+from the disk more rigorously. Metadata integrity fields now include:
+
+-  **Magic** numbers, to classify all types of metadata. This is unchanged
+   from v4.
+
+-  A copy of the filesystem **UUID**, to confirm that a given disk block is
+   connected to the superblock.
+
+-  The **owner**, to avoid accessing a piece of metadata which belongs to some
+   other part of the filesystem.
+
+-  The filesystem **block number**, to detect misplaced writes.
+
+-  The **log serial number** of the last write to this block, to avoid
+   replaying obsolete log entries.
+
+-  A CRC32c **checksum** of the entire block, to detect minor corruption.
+
+Metadata integrity coverage has been extended to all metadata blocks in the
+filesystem, with the following notes:
+
+-  Inodes can have multiple "owners" in the directory tree; therefore the
+   record contains the inode number instead of an owner or a block number.
+
+-  Superblocks have no owners.
+
+-  The disk quota file has no owner or block numbers.
+
+-  Metadata owned by files list the inode number as the owner.
+
+-  Per-AG data and B+tree blocks list the AG number as the owner.
+
+-  Per-AG header sectors don’t list owners or block numbers, since they have
+   fixed locations.
+
+-  Remote attribute blocks are not logged and therefore the LSN must be -1.
+
+This functionality enables XFS to decide that a block contents are so
+unexpected that it should stop immediately. Unfortunately checksums do 

[PATCH v2 00/22] xfs-4.20: major documentation surgery

2018-10-03 Thread Darrick J. Wong
Hi all,

This series converts the existing in-kernel xfs documentation to rst
format, links it in with the rest of the kernel's rst documetation, and
then begins pulling in the contents of the Data Structures & Algorithms
book from the xfs-documentation git tree.  No changes are made to the
text during the import process except to fix things that the conversion
process (asciidoctor + pandoc) didn't do correctly.  The goal of this
series is to tie together the XFS code with the on-disk format
documentation for the features supported by the code.

I've built the docs and put them here, in case you hate reading rst:
https://djwong.org/docs/kdoc/admin-guide/xfs.html
https://djwong.org/docs/kdoc/filesystems/xfs-data-structures/index.html

I've posted a branch here because the png import patch is huge:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=docs-4.20-merge

The patchset should apply cleanly against 4.19-rc6.  Comments and
questions are, as always, welcome.

--D


Re: [RFC PATCH 00/13] ext4: major documentation surgery

2018-07-21 Thread Darrick J. Wong
On Sat, Jul 21, 2018 at 10:19:46AM -0400, Theodore Y. Ts'o wrote:
> On Tue, Jul 10, 2018 at 10:16:42AM -0700, Darrick J. Wong wrote:
> > Hi all,
> > 
> > This series converts the existing in-kernel ext4 documentation to rst
> > format, links it in with the rest of the kernel's rst documetation, and
> > then begins pulling in the contents of the on-disk layout page in the
> > wiki.  No changes are made to the text during the import process except
> > to flatten the nested tables in the old wiki page, which were very
> > difficult to maintain.
> > 
> > I've built the docs and put them here, in case you hate reading rst:
> > https://djwong.org/docs/kdoc/filesystems/ext4/index.html
> 
> I've applied these patches, but I think we still need to do some work
> figuring out how to integrate the ext4 documentation into the
> Linux-doc tree as a whole.
> 
> At the moment they are dropped in the middle of the "Kernel API
> Documentation", by virtue of where the files are located.  Most of the
> ext4.rst file probably should be moved into the user's and
> administrator's guide.

I pondered that -- maybe leave all the ext4 stuff clustered together,
but link to it from the actual user/admin guide section?

> I'm not sure where the best place to put the ondisk documentation.
> It's probably not the user's and administrator's guide, but the Kernel
> API Documentation doesn't seem like the right place.
> 
> But it should also make sense when people are browing the
> Documentation source directories, too.
> 
> I'm not sure what the best way to do that might be, but maybe some of
> the folks on the linux-doc list will have some suggestions.

I was thinking about having a separate top-level Filesystems section
where we could put user/admin guides, on-disk documentation, etc. and
leave the FS API section alone.

--D

> Cheers,
> 
>   - Ted
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Sphinx version dependencies?

2018-07-19 Thread Darrick J. Wong
On Thu, Jul 19, 2018 at 02:15:56PM -0400, Theodore Y. Ts'o wrote:
> Darrick has sent in patches to convert the ext4 documentation to use
> rst and to be built as part of the full kernel documentation thanks.
> In addition to that, he's imported the on-disk documentation from the
> ext4 wiki into the kernel sources, so hopefully we can keep it more up
> to date.
> 
> When I was experimenting with this, I had to actually build the kernel
> docs using Sphinx for the first time.  I'm using Debian testing, so at
> first I blindly followed the instructions by
> ./scripts/sphinx-pre-install:
> 
> Detected OS: Debian GNU/Linux unstable (sid).
>   /usr/local/bin/virtualenv sphinx_1.4
>   . sphinx_1.4/bin/activate
>   pip install -r Documentation/sphinx/requirements.txt
> 
> But when I did that, Sphinx had heartburn over the ext4.rst file.
> 
> ./include/linux/spi/spi.h:373: ERROR: Unexpected indentation.
> /usr/projects/linux/ext4/Documentation/filesystems/ext4/ext4.rst:139: 
> ERROR: Malformed table.
> Column span alignment problem in table line 5.

Hmmm, apparently it's choking on the table heading borders not matching
the text:

== ===
FooBar
== ===<-- need to extend to EOL
Blah   Blah blah blah blah
== ===

Though weirdly while I /can/ get this error to reproduce with the
virtualenv 1.4.9 release, I can't get it to reproduce with the 1.3.6 or
1.6.7 ubuntu packages.  Maybe it's a python3 thing, maybe not?  Seems
pretty fragile to me.

Anyway, I'll fix ext4.rst and resubmit that part to get this moving.

--D

> ...
> 
> After consulting with Darrick, it appears the problem is that Sphinx8
> 1.4.9 was the problem.  This is the version that
> Documentation/sphinx/requirements.txt calls for.  He did his rst
> conversion work using Ubuntu 18.04's Sphinx 1.6.7.
> 
> As it turns out Debian testing/unstable already has Sphinx 1.7.5 in
> its repository, so if I simply install Sphinx 1.7.5, it works fine.
> That's what I've done for now.
> 
> So that leaves me with some questions:
> 
> * Is there a reason why scripts/sphinx-pre-install suggested using a
>   Python virtual environment and installing Sphinx 1.4.9 instead of
>   using the distro's pre-packaged Sphinx for Debian unstable/testing?
> 
> * Why does Documentation/sphinx/requirements.txt asking for such an
>   old version of Sphinx?
>  
> * Is it a requirement that *.rst files that are checked into the
>   kernel repo have to work with Sphinx 1.4.9?  Or is it sufficient
>   that it works with Sphinx 1.6.7 and 1.7.5 (which are the prepackaged
>   Debian and Ubuntu versions).  And it looks like Fedora 28 has Sphinx
>   1.7.2 if I'm not mistaken.   How many versions of Sphinx are various
>   automated build/test systems using?
> 
> Thanks,
> 
>   - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] xfs: Change URL for the project in xfs.txt

2018-03-21 Thread Darrick J. Wong
On Sat, Mar 03, 2018 at 09:43:10AM +1100, Dave Chinner wrote:
> On Fri, Mar 02, 2018 at 04:08:24PM -0600, Eric Sandeen wrote:
> > 
> > 
> > On 3/2/18 3:57 PM, Dave Chinner wrote:
> > > On Fri, Mar 02, 2018 at 09:24:01AM -0800, Darrick J. Wong wrote:
> > >> On Fri, Mar 02, 2018 at 10:30:13PM +0900, Masanari Iida wrote:
> > >>> The oss.sgi.com doesn't exist any more.
> > >>> Change it to current project URL, https://xfs.wiki.kernel.org/
> > >>>
> > >>> Signed-off-by: Masanari Iida <standby2...@gmail.com>
> > >>> ---
> > >>>  Documentation/filesystems/xfs.txt | 2 +-
> > >>>  1 file changed, 1 insertion(+), 1 deletion(-)
> > >>>
> > >>> diff --git a/Documentation/filesystems/xfs.txt 
> > >>> b/Documentation/filesystems/xfs.txt
> > >>> index 3b9b5c149f32..4d9ff0a7f8e1 100644
> > >>> --- a/Documentation/filesystems/xfs.txt
> > >>> +++ b/Documentation/filesystems/xfs.txt
> > >>> @@ -9,7 +9,7 @@ variable block sizes, is extent based, and makes 
> > >>> extensive use of
> > >>>  Btrees (directories, extents, free space) to aid both performance
> > >>>  and scalability.
> > >>>  
> > >>> -Refer to the documentation at http://oss.sgi.com/projects/xfs/
> > >>> +Refer to the documentation at https://xfs.wiki.kernel.org/
> > > 
> > > Did I miss a memo?
> > 
> > About which part, the loss of oss.sgi or the addition of the kernel.org 
> > wiki?
> > 
> > The kernel.org wiki is pretty bare though.  OTOH xfs.org is a bit less
> > official.  We really need to resolve this issue.
> 
> Moving everything to kernel.org wiki. As I mentioned on IRC, I'd
> much prefer we move away from wiki's to something we can edit
> locally, review via email, has proper revision control and a
> "publish" mechanism that pushes built documentation out to the
> public website.

Makes sense, it's sort of annoying to have to build the pdfs from
documentation repo and kup them to k.org manually.  In the meantime I'd
rather have a scribble-me-elmo wiki over a dead url.

Also afaik the only people who actually have write access to that wiki
are Luis, Eric, and me, so hopefully we won't have to deal with
vandalism in the interim.

I wonder if we could just make the existing dokiwiki use
xfs-documentation.git as its backend and control the publishing that
way...?

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Documentation: fix little inconsistencies

2017-08-29 Thread Darrick J. Wong
On Tue, Aug 29, 2017 at 09:27:08AM +0200, Pavel Machek wrote:
> 
> 
> > > index 5e40e1f..6309e90 100644
> > > --- a/Documentation/networking/switchdev.txt
> > > +++ b/Documentation/networking/switchdev.txt
> > > @@ -29,7 +29,7 @@ with SR-IOV or soft switches, such as OVS, are possible.
> > >    sw1p1  +  sw1p3  +  sw1p5  +  eth1
> > >  +|+|+|+
> > >  |||||||
> > > - +--++++-+--++---+  +-+-+
> > > + +--++++++---+  +-+-+
> > 
> > Except for this last part, looks ok.
> 
> Anything wrong here? It is fixing extra '+' in the ascii art...

There's nothing incorrect here, I merely thought it odd to send a fix
for networking documentation to the ext4 list, but not netdev?

--D

>   Pavel
> 
> > Reviewed-by: Darrick J. Wong <darrick.w...@oracle.com>
> > 
> > --D
> > 
> > >   | Switch driver |  |mgmt   |
> > >   |(this document)|  |   driver  |
> > >   |   |  |   |
> > > 
> > > -- 
> > > (english) http://www.livejournal.com/~pavelmachek
> > > (cesky, pictures) 
> > > http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
> > 
> 
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) 
> http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Documentation: fix little inconsistencies

2017-08-28 Thread Darrick J. Wong
On Mon, Aug 28, 2017 at 11:46:39AM +0200, Pavel Machek wrote:
> Fix little inconsistencies in Documentation: make case and spacing
> match surrounding text, fix ascii-art.
> 
> Signed-off-by: Pavel Machek <pa...@ucw.cz>
> 
> diff --git a/Documentation/filesystems/ext4.txt 
> b/Documentation/filesystems/ext4.txt
> index 5a8f7f4..75236c0 100644
> --- a/Documentation/filesystems/ext4.txt
> +++ b/Documentation/filesystems/ext4.txt
> @@ -94,10 +94,10 @@ Note: More extensive information for getting started with 
> ext4 can be
>  * ability to pack bitmaps and inode tables into larger virtual groups via the
>flex_bg feature
>  * large file support
> -* Inode allocation using large virtual block groups via flex_bg
> +* inode allocation using large virtual block groups via flex_bg
>  * delayed allocation
>  * large block (up to pagesize) support
> -* efficient new ordered mode in JBD2 and ext4(avoid using buffer head to 
> force
> +* efficient new ordered mode in JBD2 and ext4 (avoid using buffer head to 
> force
>the ordering)
>  
>  [1] Filesystems with a block size of 1k may see a limit imposed by the
> @@ -105,7 +105,7 @@ directory hash tree having a maximum depth of two.
>  
>  2.2 Candidate features for future inclusion
>  
> -* Online defrag (patches available but not well tested)
> +* online defrag (patches available but not well tested)
>  * reduced mke2fs time via lazy itable initialization in conjunction with
>the uninit_bg feature (capability to do this is available in e2fsprogs
>but a kernel thread to do lazy zeroing of unused inode table blocks
> @@ -602,7 +602,7 @@ Table of Ext4 specific ioctls
> bitmaps and inode table, the userspace tool thus
> just passes the new number of blocks.
>  
> -EXT4_IOC_SWAP_BOOT Swap i_blocks and associated attributes
> + EXT4_IOC_SWAP_BOOTSwap i_blocks and associated attributes
> (like i_blocks, i_size, i_flags, ...) from
> the specified inode with inode
> EXT4_BOOT_LOADER_INO (#5). This is typically
> diff --git a/Documentation/networking/switchdev.txt 
> b/Documentation/networking/switchdev.txt
> index 5e40e1f..6309e90 100644
> --- a/Documentation/networking/switchdev.txt
> +++ b/Documentation/networking/switchdev.txt
> @@ -29,7 +29,7 @@ with SR-IOV or soft switches, such as OVS, are possible.
>    sw1p1  +  sw1p3  +  sw1p5  +  eth1
>  +|+|+|+
>  |||||||
> - +--+----++----+-+--++---+  +-+-+
> + +--++++++---+  +-+-+

Except for this last part, looks ok.

Reviewed-by: Darrick J. Wong <darrick.w...@oracle.com>

--D

>   | Switch driver |  |mgmt   |
>   |(this document)|  |   driver  |
>   |   |  |   |
> 
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) 
> http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [xfstests PATCH v3 1/5] generic: add a writeback error handling test

2017-06-06 Thread Darrick J. Wong
On Tue, Jun 06, 2017 at 04:12:58PM -0400, Jeff Layton wrote:
> On Tue, 2017-06-06 at 10:17 -0700, Darrick J. Wong wrote:
> > On Tue, Jun 06, 2017 at 08:23:25PM +0800, Eryu Guan wrote:
> > > On Tue, Jun 06, 2017 at 06:15:57AM -0400, Jeff Layton wrote:
> > > > On Tue, 2017-06-06 at 16:58 +0800, Eryu Guan wrote:
> > > > > On Wed, May 31, 2017 at 09:08:16AM -0400, Jeff Layton wrote:
> > > > > > I'm working on a set of kernel patches to change how writeback 
> > > > > > errors
> > > > > > are handled and reported in the kernel. Instead of reporting a
> > > > > > writeback error to only the first fsync caller on the file, I aim
> > > > > > to make the kernel report them once on every file description.
> > > > > > 
> > > > > > This patch adds a test for the new behavior. Basically, open many 
> > > > > > fds
> > > > > > to the same file, turn on dm_error, write to each of the fds, and 
> > > > > > then
> > > > > > fsync them all to ensure that they all get an error back.
> > > > > > 
> > > > > > To do that, I'm adding a new tools/dmerror script that the C program
> > > > > > can use to load the error table. For now, that's all it can do, but
> > > > > > we can fill it out with other commands as necessary.
> > > > > > 
> > > > > > Signed-off-by: Jeff Layton <jlay...@redhat.com>
> > > > > 
> > > > > Thanks for the new tests! And sorry for the late review..
> > > > > 
> > > > > It's testing a new behavior on error reporting on writeback, I'm not
> > > > > sure if we can call it a new feature or it fixed a bug? But it's more
> > > > > like a behavior change, I'm not sure how to categorize it.
> > > > > 
> > > > > Because if it's testing a new feature, we usually let test do proper
> > > > > detection of current test environment (based on actual behavior not
> > > > > kernel version) and _notrun on filesystems that don't have this 
> > > > > feature
> > > > > yet, instead of failing the test; if it's testing a bug fix, we could
> > > > > leave the test fail on unfixed filesystems, this also serves as a
> > > > > reminder that there's bug to fix.
> > > > > 
> > > > 
> > > > Thanks for the review! I'm not sure how to categorize this either. Since
> > > > the plan is to convert all the filesystems piecemeal, maybe we should
> > > > just consider it a new feature.
> > > 
> > > Then we need a new _require rule to properly detect for the 'feature'
> > > support. I'm not sure if this is doable, but something like
> > > _require_statx, _require_seek_data_hole would be good.
> > > 
> > > > 
> > > > > I pulled your test kernel tree, and test passed on EXT4 but failed on
> > > > > other local filesystems (XFS, btrfs). I assume that's expected.
> > > > > 
> > > > > Besides this kinda high-level question, some minor comments inline.
> > > > > 
> > > > 
> > > > Yes, ext4 should pass on my latest kernel tree, but everything else
> > > > should fail. 
> 
> Oh, and I should mention that ext2/3 also pass when mounted using ext4
> driver. Legacy ext2 driver sort of works, but it reports a few too many
> errors because of the way the ext2_error macro works. That shouldn't be
> too hard to fix, I just need some guidance on that one.
> 
> I had xfs and btrfs working with an earlier iteration of the patches,
> but now that we're converting a fs at a time, it's a little more work to
> get there. It shouldn't be too hard to do though. I'll probably re-post
> in a few days, and will try to take a stab at XFS and btrfs conversion
> too.
> 
> > > 
> > > With the new _require rule, test should _notrun on XFS and btrfs then.
> > 
> > Frankly I personally prefer that upstream XFS fails until someone fixes it. 
> > :)
> > (But that's just my opinion.)
> > 
> > That said, I'm not 100% sure what's required of XFS to play nicely with
> > this new mechanism -- glancing at the ext* patches it looks like we'd
> > need to set a fs flag and possibly change some or all of the "write
> > cached dirty buffers out to disk" calls to their _since variants?
> 
> Yeah, that's pretty much the size of it.
> 
> In fact, the latter part (changing to 

Re: [xfstests PATCH v3 1/5] generic: add a writeback error handling test

2017-06-06 Thread Darrick J. Wong
On Tue, Jun 06, 2017 at 08:23:25PM +0800, Eryu Guan wrote:
> On Tue, Jun 06, 2017 at 06:15:57AM -0400, Jeff Layton wrote:
> > On Tue, 2017-06-06 at 16:58 +0800, Eryu Guan wrote:
> > > On Wed, May 31, 2017 at 09:08:16AM -0400, Jeff Layton wrote:
> > > > I'm working on a set of kernel patches to change how writeback errors
> > > > are handled and reported in the kernel. Instead of reporting a
> > > > writeback error to only the first fsync caller on the file, I aim
> > > > to make the kernel report them once on every file description.
> > > > 
> > > > This patch adds a test for the new behavior. Basically, open many fds
> > > > to the same file, turn on dm_error, write to each of the fds, and then
> > > > fsync them all to ensure that they all get an error back.
> > > > 
> > > > To do that, I'm adding a new tools/dmerror script that the C program
> > > > can use to load the error table. For now, that's all it can do, but
> > > > we can fill it out with other commands as necessary.
> > > > 
> > > > Signed-off-by: Jeff Layton 
> > > 
> > > Thanks for the new tests! And sorry for the late review..
> > > 
> > > It's testing a new behavior on error reporting on writeback, I'm not
> > > sure if we can call it a new feature or it fixed a bug? But it's more
> > > like a behavior change, I'm not sure how to categorize it.
> > > 
> > > Because if it's testing a new feature, we usually let test do proper
> > > detection of current test environment (based on actual behavior not
> > > kernel version) and _notrun on filesystems that don't have this feature
> > > yet, instead of failing the test; if it's testing a bug fix, we could
> > > leave the test fail on unfixed filesystems, this also serves as a
> > > reminder that there's bug to fix.
> > > 
> > 
> > Thanks for the review! I'm not sure how to categorize this either. Since
> > the plan is to convert all the filesystems piecemeal, maybe we should
> > just consider it a new feature.
> 
> Then we need a new _require rule to properly detect for the 'feature'
> support. I'm not sure if this is doable, but something like
> _require_statx, _require_seek_data_hole would be good.
> 
> > 
> > > I pulled your test kernel tree, and test passed on EXT4 but failed on
> > > other local filesystems (XFS, btrfs). I assume that's expected.
> > > 
> > > Besides this kinda high-level question, some minor comments inline.
> > > 
> > 
> > Yes, ext4 should pass on my latest kernel tree, but everything else
> > should fail. 
> 
> With the new _require rule, test should _notrun on XFS and btrfs then.

Frankly I personally prefer that upstream XFS fails until someone fixes it. :)
(But that's just my opinion.)

That said, I'm not 100% sure what's required of XFS to play nicely with
this new mechanism -- glancing at the ext* patches it looks like we'd
need to set a fs flag and possibly change some or all of the "write
cached dirty buffers out to disk" calls to their _since variants?
Metadata writeback errors are handled by retrying writes and/or shutting
down the fs, so I think the f_md_wb_error case is already covered.

That said, I agree that it's useful to detect that the kernel simply
lacks any of the new wb error reporting at all, so therefore we can skip
the tests.

--D

> 
> > 
> > > > ---
> > > >  common/dmerror |  13 ++--
> > > >  doc/auxiliary-programs.txt |   8 +++
> > > >  src/Makefile   |   2 +-
> > > >  src/fsync-err.c| 161 
> > > > +
> > > 
> > > New binary needs an entry in .gitignore file.
> > > 
> > 
> > OK, thanks. Will fix.
> > 
> > > >  tests/generic/999  |  76 +
> > > >  tests/generic/999.out  |   3 +
> > > >  tests/generic/group|   1 +
> > > >  tools/dmerror  |  44 +
> > > 
> > > This file is used by the test, then it should be in src/ directory and
> > > be installed along with other executable files on "make install".
> > > Because files under tools/ are not installed. Most people will run tests
> > > in the root dir of xfstests and this is not a problem, but there're
> > > still cases people do "make && make install" and run fstests from
> > > /var/lib/xfstests (default installation target).
> > > 
> > 
> > Ok, no problem. I'll move it. I wasn't sure here since dmerror is a
> > shell script, and most of the stuff in src/ is stuff that needs to be
> > built.
> 
> We do have a few perl or shell scripts in src/ dir, we can see this from
> src/Makefile
> 
> $(LTINSTALL) -m 755 fill2attr fill2fs fill2fs_check scaleread.sh 
> $(PKG_LIB_DIR)/src
> 
> >  
> > > >  8 files changed, 302 insertions(+), 6 deletions(-)
> > > >  create mode 100644 src/fsync-err.c
> > > >  create mode 100755 tests/generic/999
> > > >  create mode 100644 tests/generic/999.out
> > > >  create mode 100755 tools/dmerror
> > > > 
> > > > diff --git a/common/dmerror b/common/dmerror
> > > > index d46c5d0b7266..238baa213b1f 100644
> > > > ---