New upstream release.

author: Dimitri John Ledkov <xnox@ubuntu.com> 2018-01-11 15:44:55 +0000
committer: Dimitri John Ledkov <xnox@ubuntu.com> 2018-01-11 15:44:55 +0000
commit: d78d642bffff6ea49d62c19f26052ed6d3dcc467 (patch)
tree: db0f470018ee6f4b93fb8fd601401fa157e5dbe3 /Documentation/btrfs-man5.asciidoc
parent: b309a4dfbe8130b9fef087df59dd18a487a9c18e (diff)
1 files changed, 129 insertions, 44 deletions
diff --git a/Documentation/btrfs-man5.asciidoc b/Documentation/btrfs-man5.asciidoc
index 3981435e..1f444d73 100644
--- a/Documentation/btrfs-man5.asciidoc
+++ b/Documentation/btrfs-man5.asciidoc
@@ -26,6 +26,13 @@ This section describes mount options specific to BTRFS.  For the generic mount
 options please refer to `mount`(8) manpage. The options are sorted alphabetically
 (discarding the 'no' prefix).
 
+NOTE: most mount options apply to the whole filesystem and only options in the
+first mounted subvolume will take effect. This is due to lack of implementation
+and may change in the future. This means that (for example) you can't set
+per-subvolume 'nodatacow', 'nodatasum', or 'compress' using mount options. This
+should eventually be fixed, but it has proved to be difficult to implement
+correctly within the Linux VFS framework.
+
 *acl*::
 *noacl*::
 (default: on)
@@ -51,17 +58,17 @@ location.
 +
 WARNING: Defragmenting with Linux kernel versions < 3.9 or ≥ 3.14-rc2 as
 well as with Linux stable kernel versions ≥ 3.10.31, ≥ 3.12.12 or
-≥ 3.13.4 will break up the ref-links of CoW data (for example files
+≥ 3.13.4 will break up the reflinks of COW data (for example files
 copied with `cp --reflink`, snapshots or de-duplicated data).
 This may cause considerable increase of space usage depending on the
-broken up ref-links.
+broken up reflinks.
 
 *barrier*::
 *nobarrier*::
 (default: on)
 +
 Ensure that all IO write operations make it through the device cache and are stored
-permanently when the filesystem is at it's consistency checkpoint. This
+permanently when the filesystem is at its consistency checkpoint. This
 typically means that a flush command is sent to the device that will
 synchronize all pending data and ordinary metadata blocks, then writes the
 superblock and issues another flush.
@@ -83,21 +90,22 @@ supposed to make it to the permanent storage.
 (since: 3.0, default: off)
 +
 These debugging options control the behavior of the integrity checking
-module (the BTRFS_FS_CHECK_INTEGRITY config option required). +
+module (the BTRFS_FS_CHECK_INTEGRITY config option required). The main goal is
+to verify that all blocks from a given transaction period are properly linked.
 +
-`check_int` enables the integrity checker module, which examines all
+'check_int' enables the integrity checker module, which examines all
 block write requests to ensure on-disk consistency, at a large
-memory and CPU cost. +
+memory and CPU cost.
 +
-`check_int_data` includes extent data in the integrity checks, and
-implies the check_int option. +
+'check_int_data' includes extent data in the integrity checks, and
+implies the 'check_int' option.
 +
-`check_int_print_mask` takes a bitmask of BTRFSIC_PRINT_MASK_* values
+'check_int_print_mask' takes a bitmask of BTRFSIC_PRINT_MASK_* values
 as defined in 'fs/btrfs/check-integrity.c', to control the integrity
-checker module behavior. +
+checker module behavior.
 +
 See comments at the top of 'fs/btrfs/check-integrity.c'
-for more info.
+for more information.
 
 *clear_cache*::
 Force clearing and rebuilding of the disk space cache if something
@@ -106,10 +114,11 @@ has gone wrong. See also: 'space_cache'.
 *commit='seconds'*::
 (since: 3.12, default: 30)
 +
-Set the interval of periodic commit. Higher
-values defer data being synced to permanent storage with obvious
-consequences when the system crashes. The upper bound is not forced,
-but a warning is printed if it's more than 300 seconds (5 minutes).
+Set the interval of periodic transaction commit when data are synchronized
+to permanent storage. Higher interval values lead to larger amount of unwritten
+data, which has obvious consequences when the system crashes.
+The upper bound is not forced, but a warning is printed if it's more than 300
+seconds (5 minutes). Use with care.
 
 *compress*::
 *compress='type'*::
@@ -120,7 +129,7 @@ but a warning is printed if it's more than 300 seconds (5 minutes).
 Control BTRFS file data compression.  Type may be specified as 'zlib',
 'lzo', 'zstd' or 'no' (for no compression, used for remounting).  If no type
 is specified, 'zlib' is used.  If 'compress-force' is specified,
-the compression will allways be attempted, but the data may end up uncompressed
+then compression will always be attempted, but the data may end up uncompressed
 if the compression would make them larger.
 +
 Otherwise some simple heuristics are applied to detect an incompressible file.
@@ -141,6 +150,10 @@ Enable data copy-on-write for newly created files.
 under 'nodatacow' are also set the NOCOW file attribute (see `chattr`(1)).
 +
 NOTE: If 'nodatacow' or 'nodatasum' are enabled, compression is disabled.
++
+Updates in-place improve performance for workloads that do frequent overwrites,
+at the cost of potential partial writes, in case the write is interruted
+(system crash, device failure).
 
 *datasum*::
 *nodatasum*::
@@ -152,13 +165,31 @@ under 'nodatasum' inherit the "no checksums" property, however there's no
 corresponding file attribute (see `chattr`(1)).
 +
 NOTE: If 'nodatacow' or 'nodatasum' are enabled, compression is disabled.
++
+There is a slight performance gain when checksums are turned off, the
+correspoinding metadata blocks holding the checksums do not need to updated.
+The cost of checksumming of the blocks in memory is much lower than the IO,
+modern CPUs feature hardware support of the checksumming algorithm.
 
 *degraded*::
 (default: off)
 +
-Allow mounts with less devices than the raid profile constraints
-require.  A read-write mount (or remount) may fail with too many devices
+Allow mounts with less devices than the RAID profile constraints
+require.  A read-write mount (or remount) may fail when there are too many devices
 missing, for example if a stripe member is completely missing from RAID0.
++
+Since 4.14, the constraint checks have been improved and are verified on the
+chunk level, not an the device level. This allows degraded mounts of
+filesystems with mixed RAID profiles for data and metadata, even if the
+device number constraints would not be satisfied for some of the prifles.
++
+Example: metadata -- raid1, data -- single, devices -- /dev/sda, /dev/sdb
++
+Suppose the data are completely stored on 'sda', then missing 'sdb' will not
+prevent the mount, even if 1 missing device would normally prevent (any)
+'single' profile to mount. In case some of the data chunks are stored on 'sdb',
+then the constraint of single/data is not satisfied and the filesystem
+cannot be mounted.
 
 *device='devicepath'*::
 Specify a path to a device that will be scanned for BTRFS filesystem during
@@ -174,14 +205,22 @@ system at that point.
 *nodiscard*::
 (default: off)
 +
-Enable discarding of freed file blocks using TRIM operation.  This is useful
-for SSD devices, thinly provisioned LUNs or virtual machine images where the
-backing device understands the operation. Depending on support of the
-underlying device, the operation may severely hurt performance in case the TRIM
-operation is synchronous (eg. with SATA devices up to revision 3.0).
-+
+Enable discarding of freed file blocks.  This is useful for SSD devices, thinly
+provisioned LUNs, or virtual machine images; however, every storage layer must
+support discard for it to work. if the backing device does not support
+asynchronous queued TRIM, then this operation can severly degrade performance,
+because a synchronous TRIM operation will be attempted instead. Queued TRIM
+requires newer than SATA revision 3.1 chipsets and devices.
+
+If it is not necessary to immediately discard freed blocks, then the `fstrim`
+tool can be used to discard all free blocks in a batch. Scheduling a TRIM
+during a period of low system activity will prevent latent interference with
+the performance of other operations. Also, a device may ignore the TRIM command
+if the range is too small, so running a batch discard has a greater probability
+of actually discarding the blocks.
+
 If discarding is not necessary to be done at the block freeing time, there's
-`fstrim` tool that lets the filesystem discard all free blocks in a batch,
+`fstrim`(8) tool that lets the filesystem discard all free blocks in a batch,
 possibly not much interfering with other operations. Also, the the device may
 ignore the TRIM command if the range is too small, so running the batch discard
 can actually discard the blocks.
@@ -215,7 +254,7 @@ This option forces any data dirtied by a write in a prior transaction to commit
 as part of the current commit, effectively a full filesystem sync.
 +
 This makes the committed state a fully consistent view of the file system from
-the application's perspective (i.e., it includes all completed file system
+the application's perspective (i.e. it includes all completed file system
 operations). This was previously the behavior only when a snapshot was
 created.
 +
@@ -245,6 +284,14 @@ the option.
 +
 NOTE: Defaults to off due to a potential overflow problem when the free space
 checksums don't fit inside a single page.
++
+Don't use this option unless you really need it. The inode number limit
+on 64bit system is 2^64^, which is practically enough for the whole filesystem
+lifetime. Due to implemention of linux VFS layer, the inode numbers on 32bit
+systems are only 32 bits wide. This lowers the limit significantly and makes
+it possible to reach it. In such case, this mount option will help.
+Alternatively, files with high inode numbers can be copied to a new subvolume
+which will effectively start the inode numbers from the beginning again.
 
 *logreplay*::
 *nologreplay*::
@@ -258,7 +305,7 @@ disable that behaviour, mount also with 'nologreplay'.
 *max_inline='bytes'*::
 (default: min(2048, page size) )
 +
-Specify the maximum amount of space, in bytes, that can be inlined in
+Specify the maximum amount of space, that can be inlined in
 a metadata B-tree leaf.  The value is specified in bytes, optionally
 with a K suffix (case insensitive).  In practice, this value
 is limited by the filesystem block size (named 'sectorsize' at mkfs time),
@@ -319,8 +366,8 @@ the space cache consumes some resources, including a small amount of disk
 space.
 +
 There are two implementations of the free space cache. The original
-implementation, 'v1', is the safe default. The 'v1' space cache can be disabled
-at mount time with 'nospace_cache' without clearing.
+one, referred to as 'v1', is the safe default. The 'v1' space cache can be
+disabled at mount time with 'nospace_cache' without clearing.
 +
 On very large filesystems (many terabytes) and certain workloads, the
 performance of the 'v1' space cache may degrade drastically. The 'v2'
@@ -329,12 +376,12 @@ this issue. Once enabled, the 'v2' space cache will always be used and cannot
 be disabled unless it is cleared. Use 'clear_cache,space_cache=v1' or
 'clear_cache,nospace_cache' to do so. If 'v2' is enabled, kernels without 'v2'
 support will only be able to mount the filesystem in read-only mode. The
-`btrfs(8)` command currently only has read-only support for 'v2'. A read-write
+`btrfs`(8) command currently only has read-only support for 'v2'. A read-write
 command may be run on a 'v2' filesystem by clearing the cache, running the
 command, and then remounting with 'space_cache=v2'.
 +
 If a version is not explicitly specified, the default implementation will be
-chosen, which is 'v1' as of 4.9.
+chosen, which is 'v1'.
 
 *ssd*::
 *ssd_spread*::
@@ -342,10 +389,22 @@ chosen, which is 'v1' as of 4.9.
 (default: SSD autodetected)
 +
 Options to control SSD allocation schemes.  By default, BTRFS will
-enable or disable SSD allocation heuristics depending on whether a
-rotational or non-rotational device is in use (contents of
-'/sys/block/DEV/queue/rotational'). If it is, the 'ssd' option is turned on.
-The option 'nossd' will disable the autodetection.
+enable or disable SSD optimizations depending on status of a device with
+respect to rotational or non-rotational type. This is determined by the
+contents of '/sys/block/DEV/queue/rotational'). If it is 1, the 'ssd' option is
+turned on.  The option 'nossd' will disable the autodetection.
++
+The optimizations make use of the absence of the seek penalty that's inherent
+for the rotational devices. The blocks can be typically written faster and
+are not offloaded to separate threads.
++
+NOTE: Since 4.14, the block layout optimizations have been dropped. This used
+to help with first generations of SSD devices. Their FTL (flash translation
+layer) was not effective and the optimization was supposed to improve the wear
+by better aligning blocks. This is no longer true with modern SSD devices and
+the optimization had no real benefit. Furthermore it caused increased
+fragmentation. The layout tuning has been kept intact for the option
+'ssd_spread'.
 +
 The 'ssd_spread' mount option attempts to allocate into bigger and aligned
 chunks of unused space, and may perform better on low-end SSDs.  'ssd_spread'
@@ -354,25 +413,26 @@ will disable all SSD options.
 
 *subvol='path'*::
 Mount subvolume from 'path' rather than the toplevel subvolume. The
-'path' is absolute (ie. starts at the toplevel subvolume).
+'path' is always treated as relative to the the toplevel subvolume.
 This mount option overrides the default subvolume set for the given filesystem.
 
 *subvolid='subvolid'*::
 Mount subvolume specified by a 'subvolid' number rather than the toplevel
-subvolume.  You can use *btrfs subvolume list* to see subvolume ID numbers.
+subvolume.  You can use *btrfs subvolume list* of *btrfs subvolume show* to see
+subvolume ID numbers.
 This mount option overrides the default subvolume set for the given filesystem.
 +
 NOTE: if both 'subvolid' and 'subvol' are specified, they must point at the
-same subvolume, otherwise mount will fail.
+same subvolume, otherwise the mount will fail.
 
 *thread_pool='number'*::
 (default: min(NRCPUS + 2, 8) )
 +
-The number of worker threads to allocate. NRCPUS is number of on-line CPUs
+The number of worker threads to start. NRCPUS is number of on-line CPUs
 detected at the time of mount. Small number leads to less parallelism in
 processing data and metadata, higher numbers could lead to a performance hit
-due to increased locking contention, cache-line bouncing or costly data
-transfers between local CPU memories.
+due to increased locking contention, process scheduling, cache-line bouncing or
+costly data transfers between local CPU memories.
 
 *treelog*::
 *notreelog*::
@@ -384,13 +444,14 @@ are flushed at sync and transaction commit. If the system crashes between two
 such syncs, the pending tree log operations are replayed during mount.
 +
 WARNING: currently, the tree log is replayed even with a read-only mount! To
-disable that behaviour, mount also with 'nologreplay'.
+disable that behaviour, also mount with 'nologreplay'.
 +
 The tree log could contain new files/directories, these would not exist on
 a mounted filesystem if the log is not replayed.
 
 *usebackuproot*::
 *nousebackuproot*::
+(since: 4.6, default: off)
 +
 Enable autorecovery attempts if a bad tree root is found at mount time.
 Currently this scans a backup list of several previous tree roots and tries to
@@ -403,6 +464,11 @@ NOTE: This option has replaced 'recovery'.
 +
 Allow subvolumes to be deleted by their respective owner. Otherwise, only the
 root user can do that.
++
+NOTE: historically, any user could create a snapshot even if he was not owner
+of the source subvolume, the subvolume deletion has been restricted for that
+reason. The subvolume creation has been restricted but this mount option is
+still required. This is a usability issue and will be addressed in the future.
 
 DEPRECATED MOUNT OPTIONS
 ~~~~~~~~~~~~~~~~~~~~~~~~
@@ -428,6 +494,25 @@ but will work on 4.5+ kernels.
 A workaround option from times (pre 3.2) when it was not possible to mount a
 subvolume that did not reside directly under the toplevel subvolume.
 
+NOTES ON GENERIC MOUNT OPTIONS
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Some of the general mount options from `mount`(8) that affect BTRFS and are
+worth mentioning.
+
+*noatime*::
+under read intensive work-loads, specifying 'noatime' significantly improves
+performance because no new access time information needs to be written. Without
+this option, the default is 'relatime', which only reduces the number of
+inode atime updates in comparison to the traditional 'strictatime'. The worst
+case for atime updates under 'relatime' occurs when many files are read whose
+atime is older than 24 h and which are freshly snapshotted. In that case the
+atime is updated 'and' COW happens - for each file - in bulk. See also
+https://lwn.net/Articles/499293/ - 'Atime and btrfs: a bad combination? (LWN, 2012-05-31)'.
++
+Note that 'noatime' may break applications that rely on atime uptimes like
+the venerable Mutt (unless you use maildir mailboxes).
+
 
 FILESYSTEM FEATURES
 -------------------
@@ -566,8 +651,8 @@ long as this attribute is set (obviously the exception is unsetting the attribut
 'O_DSYNC'
 
 *X*::
-'no compression', permanently turn off compression on the given file, other
-compression mount options will not affect that
+'no compression', permanently turn off compression on the given file. Any
+compression mount options will not affect this file.
 +
 When set on a directory, all newly created files will inherit this attribute.
author	Dimitri John Ledkov <xnox@ubuntu.com>	2018-01-11 15:44:55 +0000
committer	Dimitri John Ledkov <xnox@ubuntu.com>	2018-01-11 15:44:55 +0000
commit	d78d642bffff6ea49d62c19f26052ed6d3dcc467 (patch)
tree	db0f470018ee6f4b93fb8fd601401fa157e5dbe3 /Documentation/btrfs-man5.asciidoc
parent	b309a4dfbe8130b9fef087df59dd18a487a9c18e (diff)