Transforming an encrypted array from RAID 1 to RAID 6

Transforming an encrypted array from RAID 1 to RAID 6

May 16, 2016
linux, homelab, encryption

Introduction #

Some context #

I have a NAS at home which runs on Debian Jessie. A NAS wouldn’t be a NAS without some storage, so I put in two disks with 4 TB each when I built it. Those two disks have actually been used in a setup with OpenMediaVault before that and already had a software-RAID on them. When migrating the disks (to Ubuntu at first) I learned about mdadm and that OpenMediaVault uses it. Great, that was a rather painless transition!

Current situation #

In the meantime I migrated my system to Debian and put and encryption layer with LUKS on top of the RAID. Now I bought two more drives and want to extend my capacity.

Output from lsblk currently looks kind of like this:

sdc                       8:32   0   3.7T  0 disk  
└─sdc1                    8:33   0   3.7T  0 part  
  └─md127                 9:127  0   3.7T  0 raid1 
    └─greens_crypt      254:3    0   3.7T  0 crypt /mnt/arr
sdd                       8:48   0   3.7T  0 disk  
└─sdd1                    8:49   0   3.7T  0 part  
sde                       8:64   0   3.7T  0 disk  
└─sde1                    8:65   0   3.7T  0 part  
  └─md127                 9:127  0   3.7T  0 raid1 
    └─greens_crypt      254:3    0   3.7T  0 crypt /mnt/arr
sdf                       8:80   0   3.7T  0 disk  
└─sdf1                    8:81   0   3.7T  0 part  

Nevermind the sorting .. I must have switched some cables when I put in the new drives. Mounting happens by UUID anyway.

We see a software-RAID with level 1 on across two partitions sdc1 and sde1. On top of that is a dm-crypt device using LUKS encryption mode, which is then formatted with ext4 and mounted at /mnt/arr.

Preparation #

Replicate the partitioning layout #

The new disks are sdd and sdf. First, I copied the partitioning layout from one of the old disks. blockdev reports the exact same size for all four disks. But it is a good idea to create a partition which is slightly smaller than that anyway - just in case you ever have to replace a drive with another which lacks just a couple of megabytes at the end ..

To do that, you can use the replication command of sgdisk:

• root ~ # SOURCE='/dev/sdc'; TARGETS='/dev/sdd /dev/sdf';
• root ~ # for target in $TARGETS; do
> sgdisk --replicate=$target $SOURCE
> sgdisk --randomize-guids $target
> done

That will copy the partitioning table from /dev/sdc to /dev/sdd and /dev/sdf and then randomize the GUIDs of the latter two.

How to extend the array? #

I searched the web for a while, looking for the best approach here ..

Options like BTRFS or ZFS?

They looked very interesting but would have made the encryption layer rather difficult.

Should I use RAID 5 and have ~ 12 terabytes of capacity?

No. Search for ‘RAID 5’ and ‘URE’ and bear in mind I have 4 TB disks here. You’ll find a reason pretty quickly. (hint: disks are likely to fail on rebuilding)

Create a new degraded RAID 6, copy files and then add the old disks?

That would require two re-synchronizations if you want to make sure to have a working array at all times and it would still leave you in state with two degraded arrays at one point. Also you’ll need to unmount the old array and change all your entries in fstab and crypttab etc.

Add the two disks as spares to the existing array and then convert it?

This is definitely the cleanest approach, is a function of mdadm itself since version 3.-something and you can just keep using your array without worrying about inconsistent states between old and new array. It requires two reshaping operations though because it is not currently possible to go from RAID 1 to RAID 6 directly. So that’s what we’re going to do.

Let’s do it #

Add the new disks as spares #

If you replicated the partition layout properly this should be rather painless and instant:

• root ~ # mdadm /dev/md127 --add /dev/sdd
mdadm: added /dev/sdd
• root ~ # mdadm /dev/md127 --add /dev/sdf
mdadm: added /dev/sdf

If you look at your array with mdadm --detail /dev/md127 you should see the new drives added as spares at the end of the output.

Grow the RAID to level 5 #

Now we grow the RAID to level 5 across three disks and thereby initiate a re-sync:

• root ~ # mdadm /dev/md127 --grow --level=5 --raid-devices=3
mdadm: level of /dev/md127 changed to raid5

This will initiate a reshape and your mdadm --detail output should look similar to this:

• root ~ # mdadm --detail /dev/md127
        Version : 1.2
  Creation Time : Fri Nov 20 18:15:36 2015
     Raid Level : raid5
     Array Size : 3906885440 (3725.90 GiB 4000.65 GB)
  Used Dev Size : 3906885440 (3725.90 GiB 4000.65 GB)
   Raid Devices : 3
  Total Devices : 4
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Sun May 22 16:02:44 2016
          State : clean, reshaping 
 Active Devices : 3
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 64K

 Reshape Status : 7% complete
  Delta Devices : 1, (2->3)

           Name : fractal:greens
           UUID : e096[...]e57e
         Events : 52263

    Number   Major   Minor   RaidDevice State
       0       8       65        0      active sync   /dev/sde1
       2       8       33        1      active sync   /dev/sdc1
       4       8       81        2      active sync   /dev/sdf1

       3       8       49        -      spare   /dev/sdd1

I performed the --grow operation this morning and it has been running since. If you look at /proc/mdstat you can get an idea of how long this takes:

• root ~ # cat /proc/mdstat 
Personalities : [raid1] [raid6] [raid5] [raid4] 
md127 : active raid5 sdf1[4] sdd1[3](S) sde1[0] sdc1[2]
      3906885440 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] [UUU]
      [=>...................]  reshape =  7.2% (284918016/3906885440) finish=3292.1min speed=18336K/sec
      bitmap: 0/30 pages [0KB], 65536KB chunk

unused devices: <none>

That is a little more than two days remaining. It does get a little bit faster if it is not mounted and the LUKS device is not opened. But that’s the beauty of this: you can have the array mounted and in use while you do this. If you can live with slow performance, that is …

After this operation is finished you should already have more capacity. (Close to 8 TB in this case.)

Keep in mind, this system has an Intel Celeron J1900 (on a Supermicro X10SBA), which is great in terms of power efficiency but not exactly the fastest processor around. The drives are not the fastest either, so YMMV.

Grow the RAID to level 6 #

After the first reshaping finished after almost three days, I quickly verified that the array grew to 8 TB and issued the command for the next reshaping operation:

• root ~ # mdadm /dev/md127 --grow --level=6 --raid-devices=4
mdadm: level of /dev/md127 changed to raid6

As noted above, the array can stay live during all this time, so I decided to resize the filesystem while I’m at it.

Resize the dm-crypt device #

As we are growing in size, we need to resize from the bottom up. The array already grew, so now we resize the dm-crypt device with this simple command:

• root ~ # cryptsetup resize /dev/mapper/greens_crypt

Obviously, replace with your device accordingly.

Resize the filesystem #

Lastly we need to resize the filesystem that we have on top of our encrypted device. In my case that is a simple ext4 filesystem and no additional partitioning or LVMs.

The steps of course vary for each filesystem and configuration. In case of an ext4 you first check the filesystem before you resize it. Unfortunately, you have to unmount the device to run e2fsck:

# umount /mnt/arr
# e2fsck -f /dev/mapper/greens_crypt
# resize2fs /dev/mapper/greens_crypt
# mount /dev/mapper/greens_crypt /mnt/arr

The resize will take a while. It took about 15 minutes in my case with the reshaping operation already running in the background.

For a btrfs filesystem you actually have to do the resizing while the device is mounted:

# btrfs filesystem resize max /mnt/arr

Conclusion #

Right now, the second reshaping is in progress and it looks like it might take up to three days again.