FR1 - the intelligent ast software RAID1 driver

FR1 - fast intelligent software RAID1 linux driver

Peter T. Breuer
December 2004

This is the fast software RAID1 (mirror) linux driver, "fr1", pronounced "ferrari"! It's intelligent. That is, it doesn't blindly resynchronize a whole mirror component when only a few blocks need resyncing. That can save hours of resync time on a large device.

What problem does fr1 solve? It is intended for the situation when network block devices comprise some of the components of the mirror. See ENBD for such a networked block device, and linux kernel NBD. In that setting, resyncs of the whole device are both costly in terms of the time spent doing the resync and increased frequency of occurrence, because of the relative commonness of network brownouts as compared to whole disk failures.

How does it work? The driver keeps a bitmap of pending writes in memory, and writes only those marked there to the mirror component when it comes back on line. The bitmap is created pagewise on demand, so it's not expensive on memory. A terabyte sized device with blocks of 4K will cost max 32MB of memory for the bitmap and probably will never use as much as one percent of that maximum. The driver is tolerant wrt memory faults - it'll still work if you run out of memory, just with less precision.

What kernels is it for? The code was developed and tested for the 2.4 kernels, and has just been ported to the 2.6 kernels (fr1-2.15b). See the patches directory in the archive for details of the latter.

DOWNLOAD
HOW TO EXTRACT THE FILES FROM THE ARCHIVE
HOW TO COMPILE THE DRIVER
HOW TO USE IT
MAILING LIST

HOW TO EXTRACT THE FILES FROM THE ARCHIVE

Let's start with the basics. Do:

cd /tmp; tar xzvf fr1-2.0.tgz; cd fr1-2.0

or similar. Substitute /tmp by the directory where you plan on doing the compiling, and substitute "2.0" by the actual version number on the archive.

HOW TO COMPILE THE DRIVER

For the 2.4 kernels, edit the Makefile in the source directory, change LINUXDIR to point to the kernel source for your target kernel (that'll be /usr/local/src/linux-2.4.20, or some nearby approximate), and if you are compiling for an SMP machine, set SMPOPTS to "-D__SMP__", otherwise set it to "" (empty string).

For the 2.6 kernels, you will have to apply a patch from the patches directory in the archive. Change directory to /usr/local/src/linux-2.6.blah, and run (as root, via sudo or su)
patch -b -p1 < /tmp/fr1-2.0/patches/linux-2.6.blah.patch

Replace /tmp by the place you unarchived the source code to, and 2.0 by the actual version number. Then run "make oldconfig" and say "y" to FR1 support and either "y" or "m" to bitmap support.
For the 2.4 kernels, type "make" in the source directory and wait till cooked - you'll find the results of the cooking below the build/linux-blah.blah.blah/ subdir.

For the 2.6 kernels, type "make modules" in the kernel source directory root - you'll find the results in the kernel drivers/md/ subdirectory.
For the 2.4 kernels, put the freshly built fr1.o module in the misc/ subdirectory of your kernel modules in /lib/modules/blah.blah.blah/ and replace the kernel md.omodule with the md.o module that just got made. Andd the new bitmap.o module too.

For the 2.6 kernels, The modules you want are raid1.ko, md.ko and bitmap.ko.
run /sbin/depmod -a, if you are running under the target kernel right now.

HOW TO USE IT

insmod md.o; insmod fr1.o

modprobe fr1

Next, you use raidtools (or mdadm). The following instructions are for raidtools.

edit /etc/raidtab and put in an entry for a typical raid1 mirror device for /dev/md0. Here's an example:

raiddev /dev/md0
    raid-level               1
    nr-raid-disks            2
    nr-spare-disks           0
    persistent-superblock    0
    chunk-size               4
    device                   /dev/loop0
    raid-disk                0
    device                   /dev/loop1
    raid-disk                1

That was for a two-way mirror with two loop devices as components.

make the mirror in the usual way with the mkraid utility. For example:

mkraid --dangerous-no-resync --force /dev/md0

I don't see the point of NOT using --dangerous-no-resync. You can always do the sync a moment or two afterwards.

At this point you can "cat /proc/mdstat" and see how things look. Here is how they should look for the raidstat configuration detailed above.

Personalities : [raid1]
read_ahead 4 sectors
md0 : active raid1 [dev 07:00][0] [dev 07:01][1]
1024 blocks

You can now manipulate the mirror with the raidsetfaulty, raidhotremove, and raidhotadd tools. Raidstop and raidstart might also be useful.

The only difference with respect to normal usage is that a raidhotadd will WORK after a raidsetfaulty. You don't have to do a raidhotremove first. If you do the raidhotadd after a raidsetfaulty, then ONLY THE BLOCKS NOT WRITTEN IN THE INTERVAL are resynced. Not the whole device. So you want to do this!

For example, to fault one mirror component:

raidsetfaulty /dev/md0 /dev/loop1

After this, the output from /proc/mdstat will show a failed component. It won't be written to or read:

Personalities : [raid1]
read_ahead 4 sectors
fr10 : active fr1 [dev 07:00][0] [dev 07:01][1](F)
1024 blocks

Then to put the "failed" component back on line:

raidhotadd /dev/md0 /dev/loop1

and the situation will return to normal, immediately. Only a few dirtied blocks will have been written to the newly added device.

Personalities : [raid1]
read_ahead 4 sectors
md0 : active fr1 [dev 07:00][0] [dev 07:01][1]
1024 blocks

If you want to take the "failed" component fully offline, then you must follow the raidsetfaulty with a

raidhotremove /dev/md0 /dev/loop1

After this, you can still put the component back with raidhotadd, but the background resync will be total. You really want to avoid that.

Oh yes. You can now mkfs on the device, mount it, write files to it, etc. To stop (and deconfigure) the device, do

raidstop /dev/md0

No, I don't know what raidstart is supposed to do on a non-persistent array. It doesn't do anything on fr1.

If you fault one device, then write to the device, then hotadd the faulted device back in, you should be able to see from the kernel messages (use "dmesg") that the resync is intelligent. Here's some dmesg output:

raid1 resync starts on device 0 component 1 for 1024 blocks
raid1 resynced dirty blocks 0-9
raid1 resync skipped clean blocks 10-1023
raid1 resync terminates with 0 errs on device 0 component 1
raid1 hotadd component 7.1[1] to device 0

This resync only copied across blocks 0-9, and skipped the rest.

While the resync is happening, /proc/mdstat will show progress, like so:

Personalities : [raid1]
read_ahead 4 sectors
md0 : active fr1 [dev 07:00][0] [dev 07:01][1](F)
1024 blocks
[=======>.............] resync=35.5% (364/1024)

BUGS Etc.

I don't know how (or if) this works with mirrors with more than two components - my testing and development has never touched on that case. Let me know of problems and I'll fix them.
Ditto for arrays with spare components. The 2.4 kernel code is very obscure in the area of array management and I can be excused! The situation has bettered in 2.6, but it is still not perfectly obvious code.
I really don't know what to do about mirror components that have a nonzero offset recorded for them in the array metadata. I've never seen it in the field. Probably things will go horribly wrong.

MAILING LIST

fr1-general@lists.sourceforge.net
You can subscribe via email. Send a message to Fr1-general-request@lists.sourceforge.net with the word `help' in the subject or body (don't include the quotes), and you will get back a message with the real instructions in!

AUTHOR

Peter T. Breuer (ptb@it.uc3m.es ) December 2004.