external-reshape-design.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168

External Reshape

1 Problem statement

External (third-party metadata) reshape differs from native-metadata
reshape in three key ways:

1.1 Format specific constraints

In the native case reshape is limited by what is implemented in the
generic reshape routine (Grow_reshape()) and what is supported by the
kernel.  There are exceptional cases where Grow_reshape() may block
operations when it knows that the kernel implementation is broken, but
otherwise the kernel is relied upon to be the final arbiter of what
reshape operations are supported.

In the external case the kernel, and the generic checks in
Grow_reshape(), become the super-set of what reshapes are possible.  The
metadata format may not support, or have yet to implement a given
reshape type.  The implication for Grow_reshape() is that it must query
the metadata handler and effect changes in the metadata before the new
geometry is posted to the kernel.  The ->reshape_super method allows
Grow_reshape() to validate the requested operation and post the metadata
update.

1.2 Scope of reshape

Native metadata reshape is always performed at the array scope (no
metadata relationship with sibling arrays on the same disks).  External
reshape, depending on the format, may not allow the number of member
disks to be changed in a subarray unless the change is simultaneously
applied to all subarrays in the container.  For example the imsm format
requires all member disks to be a member of all subarrays, so a 4-disk
raid5 in a container that also houses a 4-disk raid10 array could not be
reshaped to 5 disks as the imsm format does not support a 5-disk raid10
representation.  This requires the ->reshape_super method to check the
contents of the array and ask the user to run the reshape at container
scope (if both subarrays are agreeable to the change), or report an
error in the case where one subarray cannot support the change.

1.3 Monitoring / checkpointing

Reshape, unlike rebuild/resync, requires strict checkpointing to survive
interrupted reshape operations.  For example when expanding a raid5
array the first few stripes of the array will be overwritten in a
destructive manner.  When restarting the reshape process we need to know
the exact location of the last successfully written stripe, and we need
to restore the data in any partially overwritten stripe.  Native
metadata stores this backup data in the unused portion of spares that
are being promoted to array members, or in an external backup file
(located on a non-involved block device).

The kernel is in charge of recording checkpoints of reshape progress,
but mdadm is delegated the task of managing the backup space which
involves:
1/ Identifying what data will be overwritten in the next unit of reshape
   operation
2/ Suspending access to that region so that a snapshot of the data can
   be transferred to the backup space.
3/ Allowing the kernel to reshape the saved region and setting the
   boundary for the next backup.

In the external reshape case we want to preserve this mdadm
'reshape-manager' arrangement, but have a third actor, mdmon, to
consider.  It is tempting to give the role of managing reshape to mdmon,
but that is counter to its role as a monitor, and conflicts with the
existing capabilities and role of mdadm to manage the progress of
reshape.  For clarity the external reshape implementation maintains the
role of mdmon as a (mostly) passive recorder of raid events, and mdadm
treats it as it would the kernel in the native reshape case (modulo
needing to send explicit metadata update messages and checking that
mdmon took the expected action).

External reshape can use the generic md backup file as a fallback, but in the
optimal/firmware-compatible case the reshape-manager will use the metadata
specific areas for managing reshape.  The implementation also needs to spawn a
reshape-manager per subarray when the reshape is being carried out at the
container level.  For these two reasons the ->manage_reshape() method is
introduced.  This method in addition to base tasks mentioned above:
1/ Spawns a manager per-subarray, when necessary
2/ Uses either generic routines in Grow.c for md-style backup file
   support, or uses the metadata-format specific location for storing
   recovery data.
This aims to avoid a "midlayer mistake"[1] and lets the metadata handler
optionally take advantage of generic infrastructure in Grow.c

2 Details for specific reshape requests

There are quite a few moving pieces spread out across md, mdadm, and mdmon for
the support of external reshape, and there are several different types of
reshape that need to be comprehended by the implementation.  A rundown of
these details follows.

2.0 General provisions:

Obtain an exclusive open on the container to make sure we are not
running concurrently with a Create() event.

2.1 Freezing sync_action

2.2 Reshape size

   1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
      initializes st->update_tail
   2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the size change
      is allowed (being performed at subarray scope / enough room) prepares a
      metadata update
   3/ mdadm::Grow_reshape(): flushes the metadata update (via
      flush_metadata_update(), or ->sync_metadata())
   4/ mdadm::Grow_reshape(): post the new size to the kernel


2.3 Reshape level (simple-takeover)

"simple-takeover" implies the level change can be satisfied without touching
sync_action

    1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
       initializes st->update_tail
    2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the level change
       is allowed (being performed at subarray scope) prepares a
       metadata update
       2a/ raid10 --> raid0: degrade all mirror legs prior to calling
           ->reshape_super
    3/ mdadm::Grow_reshape(): flushes the metadata update (via
       flush_metadata_update(), or ->sync_metadata())
    4/ mdadm::Grow_reshape(): post the new level to the kernel

2.4 Reshape chunk, layout

2.5 Reshape raid disks (grow)

    1/ mdadm::Grow_reshape(): unconditionally initializes st->update_tail
       because only redundant raid levels can modify the number of raid disks
    2/ mdadm::Grow_reshape(): calls ->reshape_super() to check that the level
       change is allowed (being performed at proper scope / permissible
       geometry / proper spares available in the container) prepares a metadata
       update.
    3/ mdadm::Grow_reshape(): Converts each subarray in the container to the
       raid level that can perform the reshape and starts mdmon.
    4/ mdadm::Grow_reshape(): Pushes the update to mdmon...
       4a/ mdmon::process_update(): marks the array as reshaping
       4b/ mdmon::manage_member(): adds the spares (without assigning a slot)
    5/ mdadm::Grow_reshape(): Notes that mdmon has assigned spares and invokes
       ->manage_reshape()
    5/ mdadm::<format>->manage_reshape(): (for each subarray) sets sync_max to
       zero, starts the reshape, and pings mdmon
       5a/ mdmon::read_and_act(): notices that reshape has started and notifies
           the metadata handler to record the slots chosen by the kernel
    6/ mdadm::<format>->manage_reshape(): saves data that will be overwritten by
       the kernel to either the backup file or the metadata specific location,
       advances sync_max, waits for reshape, ping mdmon, repeat.
       6a/ mdmon::read_and_act(): records checkpoints
    7/ mdadm::<format>->manage_reshape(): Once reshape completes changes the raid
       level back to the nominal raid level (if necessary)

       FIXME: native metadata does not have the capability to record the original
       raid level in reshape-restart case because the kernel always records current
       raid level to the metadata, whereas external metadata can masquerade at an
       alternate level based on the reshape state.

2.6 Reshape raid disks (shrink)

3 TODO

...

[1]: Linux kernel design patterns - part 3, Neil Brown http://lwn.net/Articles/336262/