path: root/mdmon.8
diff options
Diffstat (limited to 'mdmon.8')
1 files changed, 257 insertions, 0 deletions
diff --git a/mdmon.8 b/mdmon.8
new file mode 100644
index 00000000..4f9a439a
--- /dev/null
+++ b/mdmon.8
@@ -0,0 +1,257 @@
+.\" See file COPYING in distribution for details.
+.TH MDMON 8 "" v3.3.2
+mdmon \- monitor MD external metadata arrays
+.BI mdmon " [--all] [--takeover] [--foreground] CONTAINER"
+The 2.6.27 kernel brings the ability to support external metadata arrays.
+External metadata implies that user space handles all updates to the metadata.
+The kernel's responsibility is to notify user space when a "metadata event"
+occurs, like disk failures and clean-to-dirty transitions. The kernel, in
+important cases, waits for user space to take action on these notifications.
+.SS Metadata updates:
+To service metadata update requests a daemon,
+.IR mdmon ,
+is introduced.
+.I Mdmon
+is tasked with polling the sysfs namespace looking for changes in
+.BR array_state ,
+.BR sync_action ,
+and per disk
+.BR state
+attributes. When a change is detected it calls a per metadata type
+handler to make modifications to the metadata. The following actions
+are taken:
+.B array_state \- inactive
+Clear the dirty bit for the volume and let the array be stopped
+.B array_state \- write pending
+Set the dirty bit for the array and then set
+.B array_state
+.BR active .
+are blocked until userspace writes
+.BR active.
+.B array_state \- active-idle
+The safe mode timer has expired so set array state to clean to block writes to the array
+.B array_state \- clean
+Clear the dirty bit for the volume
+.B array_state \- read-only
+This is the initial state that all arrays start at.
+.I mdmon
+takes one of the three actions:
+Transition the array to read-auto keeping the dirty bit clear if the metadata
+handler determines that the array does not need resyncing or other modification
+Transition the array to active if the metadata handler determines a resync or
+some other manipulation is necessary
+Leave the array read\-only if the volume is marked to not be monitored; for
+example, the metadata version has been set to "external:\-dev/md127" instead of
+.B sync_action \- resync\-to\-idle
+Notify the metadata handler that a resync may have completed. If a resync
+process is idled before it completes this event allows the metadata handler to
+checkpoint resync.
+.B sync_action \- recover\-to\-idle
+A spare may have completed rebuilding so tell the metadata handler about the
+state of each disk. This is the metadata handler's opportunity to clear
+any "out-of-sync" bits and clear the volume's degraded status. If a recovery
+process is idled before it completes this event allows the metadata handler to
+checkpoint recovery.
+.B <disk>/state \- faulty
+A disk failure kicks off a series of events. First, notify the metadata
+handler that a disk has failed, and then notify the kernel that it can unblock
+writes that were dependent on this disk. After unblocking the kernel this disk
+is set to be removed+ from the member array. Finally the disk is marked failed
+in all other member arrays in the container.
++ Note This behavior differs slightly from native MD arrays where
+removal is reserved for a
+.B mdadm --remove
+event. In the external metadata case the container holds the final
+reference on a block device and a
+.B mdadm --remove <container> <victim>
+call is still required.
+.SS Containers:
+External metadata formats, like DDF, differ from the native MD metadata
+formats in that they define a set of disks and a series of sub-arrays
+within those disks. MD metadata in comparison defines a 1:1
+relationship between a set of block devices and a RAID array. For
+example to create 2 arrays at different RAID levels on a single
+set of disks, MD metadata requires the disks be partitioned and then
+each array can be created with a subset of those partitions. The
+supported external formats perform this disk carving internally.
+Container devices simply hold references to all member disks and allow
+tools like
+.I mdmon
+to determine which active arrays belong to which
+container. Some array management commands like disk removal and disk
+add are now only valid at the container level. Attempts to perform
+these actions on member arrays are blocked with error messages like:
+"mdadm: Cannot remove disks from a \'member\' array, perform this
+operation on the parent container"
+Containers are identified in /proc/mdstat with a metadata version string
+"external:<metadata name>". Member devices are identified by
+"external:/<container device>/<member index>", or "external:-<container
+device>/<member index>" if the array is to remain readonly.
+.B container
+device to monitor. It can be a full path like /dev/md/container, or a
+simple md device name like md127.
+.B \-\-foreground
+.I mdmon
+will fork and continue in the background. Adding this option will
+skip that step and run
+.I mdmon
+in the foreground.
+.B \-\-takeover
+This instructs
+.I mdmon
+to replace any active
+.I mdmon
+which is currently monitoring the array. This is primarily used late
+in the boot process to replace any
+.I mdmon
+which was started from an
+.B initramfs
+before the root filesystem was mounted. This avoids holding a
+reference on that
+.B initramfs
+indefinitely and ensures that the
+.I pid
+.I sock
+files used to communicate with
+.I mdmon
+are in a standard place.
+.B \-\-all
+This tells mdmon to find any active containers and start monitoring
+each of them if appropriate. This is normally used with
+.B \-\-takeover
+late in the boot sequence.
+A separate
+.I mdmon
+process is started for each container as the
+.B \-\-all
+argument is over-written with the name of the container. To allow for
+containers with names longer than 5 characters, this argument can be
+arbitrarily extended, e.g. to
+.BR \-\-all-active-arrays .
+Note that
+.I mdmon
+is automatically started by
+.I mdadm
+when needed and so does not need to be considered when working with
+RAID arrays. The only times it is run other than by
+.I mdadm
+is when the boot scripts need to restart it after mounting the new
+root filesystem.
+.I mdmon
+needs to be running whenever any filesystem on the monitored device is
+mounted there are special considerations when the root filesystem is
+mounted from an
+.I mdmon
+monitored device.
+Note that in general
+.I mdmon
+is needed even if the filesystem is mounted read-only as some
+filesystems can still write to the device in those circumstances, for
+example to replay a journal after an unclean shutdown.
+When the array is assembled by the
+.B initramfs
+code, mdadm will automatically start
+.I mdmon
+as required. This means that
+.I mdmon
+must be installed on the
+.B initramfs
+and there must be a writable filesystem (typically tmpfs) in which
+.B mdmon
+can create a
+.B .pid
+.B .sock
+file. The particular filesystem to use is given to mdmon at compile
+time and defaults to
+.BR /run/mdadm .
+This filesystem must persist through to shutdown time.
+After the final root filesystem has be instantiated (usually with
+.BR pivot_root )
+.I mdmon
+should be run with
+.I "\-\-all \-\-takeover"
+so that the
+.I mdmon
+running from the
+.B initramfs
+can be replaced with one running in the main root, and so the
+memory used by the initramfs can be released.
+At shutdown time,
+.I mdmon
+should not be killed along with other processes. Also as it holds a
+file (socket actually) open in
+.B /dev
+(by default) it will not be possible to unmount
+.B /dev
+if it is a separate filesystem.
+.B " mdmon \-\-all-active-arrays \-\-takeover"
+.I mdmon
+which is currently running is killed and a new instance is started.
+This should be run during in the boot sequence if an initramfs was
+used, so that any mdmon running from the initramfs will not hold
+the initramfs active.
+.IR mdadm (8),
+.IR md (4).