~/cameron

Removing a Drive from a Btrfs Array

Metadata
  • Last Updated:
  • Published:
  • Tags: #linux
  • Summary: If a drive is failing in an array, Btrfs could block attempts at removing the drive due to corrupted files. A quick write-up of how to get it removed.

I recently ran into a situation where I knew one of the hard drives in my Btrfs array was failing. Simply following the Btrfs wiki removing devices instructions was unsuccessful as the btrfs delete device command would throw an error, cancelling the device deletion process.

At this point, I was okay with losing any corrupted/unrecoverable files. You should already be past the point of restoring from backup, attempting other recovery methods, etc. I just wanted to get the dying drive removed from the array since I knew I didn’t have critical data on the array.

Warning

Make sure you’ve exhausted all other resources before continuing. Check the ArchWiki, Btrfs mailing list, etc. This will lead to data loss!

Scrub the Array

Start a scrub to identify files blocking the device from getting deleted:

1[cameron@host ~]$ sudo btrfs scrub start /mnt/btrfs

The scrub will take a while. In my case, it took nearly 27 hours:

 1[cameron@host ~]$ sudo btrfs scrub status /mnt/btrfs
 2UUID:             16df0a0e-1dad-439f-aee1-f9961122fe59
 3Scrub started:    Tue Jul  4 11:27:56 2023
 4Status:           finished
 5Duration:         26:55:47
 6Total to scrub:   31.20TiB
 7Rate:             338.01MiB/s
 8Error summary:    read=256
 9    Corrected:      238
10    Uncorrectable:  18
11    Unverified:     0

Remove Uncorrectable Errors

Btrfs was able to correct many of the errors it came across, but in this instance I was left with 18 uncorrectable errors. As the scrub identifies these, it will report them into the kernel ring buffer (dmesg). I found it nearly impossible to search through that, since the messages would either fall out of the log by the time I came back to the terminal to search. Instead, journalctl keeps a full enough history to search through:

 1[cameron@host ~]$ sudo journalctl --dmesg --grep 'unable to fixup'
 2Jul 04 13:56:48 host kernel: BTRFS error (device sdh): unable to fixup (regular) error at logical 32574079762432 on dev /dev/sdd physical 1251936043008
 3Jul 04 13:58:27 host kernel: BTRFS error (device sdh): unable to fixup (regular) error at logical 32615643807744 on dev /dev/sdd physical 1265582800896
 4Jul 04 14:15:53 host kernel: BTRFS error (device sdh): unable to fixup (regular) error at logical 33092168253440 on dev /dev/sdd physical 1424279666688
 5Jul 04 14:16:08 host kernel: BTRFS error (device sdh): unable to fixup (regular) error at logical 33092168646656 on dev /dev/sdd physical 1424280059904
 6Jul 04 14:16:08 host kernel: BTRFS error (device sdh): unable to fixup (regular) error at logical 33092168515584 on dev /dev/sdd physical 1424279928832
 7Jul 04 14:16:23 host kernel: BTRFS error (device sdh): unable to fixup (regular) error at logical 33092169498624 on dev /dev/sdd physical 1424280911872
 8Jul 04 14:16:23 host kernel: BTRFS error (device sdh): unable to fixup (regular) error at logical 33092169760768 on dev /dev/sdd physical 1424281174016
 9Jul 04 14:16:30 host kernel: BTRFS error (device sdh): unable to fixup (regular) error at logical 33092169891840 on dev /dev/sdd physical 1424281305088
10Jul 04 14:16:45 host kernel: BTRFS error (device sdh): unable to fixup (regular) error at logical 33092171005952 on dev /dev/sdd physical 1424282419200
11Jul 04 14:16:45 host kernel: BTRFS error (device sdh): unable to fixup (regular) error at logical 33092171137024 on dev /dev/sdd physical 1424282550272
12Jul 04 14:16:53 host kernel: BTRFS error (device sdh): unable to fixup (regular) error at logical 33092172382208 on dev /dev/sdd physical 1424283795456
13Jul 04 14:17:00 host kernel: BTRFS error (device sdh): unable to fixup (regular) error at logical 33092201283584 on dev /dev/sdd physical 1424312696832
14Jul 04 14:17:07 host kernel: BTRFS error (device sdh): unable to fixup (regular) error at logical 33092202528768 on dev /dev/sdd physical 1424313942016
15Jul 04 14:17:19 host kernel: BTRFS error (device sdh): unable to fixup (regular) error at logical 33092203773952 on dev /dev/sdd physical 1424315187200
16Jul 04 14:17:19 host kernel: BTRFS error (device sdh): unable to fixup (regular) error at logical 33092203773952 on dev /dev/sdd physical 1424315187200
17Jul 04 14:17:26 host kernel: BTRFS error (device sdh): unable to fixup (regular) error at logical 33092205019136 on dev /dev/sdd physical 1424316432384
18Jul 04 14:17:38 host kernel: BTRFS error (device sdh): unable to fixup (regular) error at logical 33092209999872 on dev /dev/sdd physical 1424321413120
19Jul 04 14:17:38 host kernel: BTRFS error (device sdh): unable to fixup (regular) error at logical 33092209999872 on dev /dev/sdd physical 1424321413120

Now we need to translate those logicals into files we can actually attempt removing:

1[cameron@host ~]$ sudo btrfs inspect-internal logical-resolve 32574079762432 /mnt/btrfs
2/mnt/btrfs/P6300090.mov

The output shows logical 32574079762432 refers to /mnt/btrfs/P6300090.mov. Remove the file, doing the same for all other logicals reported as uncorrectable errors. Several of the errors I received seemed to go away/not refer to a file. I just attempted removing any files that were identified, ignoring logicals that didn’t resolve to anything.

Delete the Device

Now remount the Btrfs array, marking it as degraded. Then, delete the device you want to remove (in my case /dev/sdd):

1[cameron@host ~]$ sudo mount -o degraded /dev/sdd /mnt/btrfs
2[cameron@host ~]$ sudo btrfs device delete /dev/sdd /mnt/btrfs

The device deletion will also take a while, but now you should see the used data total on the device you’re deleting reduce over time:

 1[cameron@host ~]$ sudo btrfs fi show
 2Label: none  uuid: 16df0a0e-1dad-439f-aee1-f9961122fe59
 3        Total devices 7 FS bytes used 31.16TiB
 4        devid    1 size 14.55TiB used 13.96TiB path /dev/sdh
 5        devid    2 size 0.00B used 962.01GiB path /dev/sdd
 6        devid    3 size 12.73TiB used 12.14TiB path /dev/sda
 7        devid    4 size 12.73TiB used 1.05TiB path /dev/sdb
 8        devid    5 size 12.73TiB used 1.05TiB path /dev/sdc
 9        devid    6 size 12.73TiB used 1.05TiB path /dev/sdf
10        devid    7 size 12.73TiB used 1.05TiB path /dev/sdg

And here’s the same a while later:

 1[cameron@host ~]$ sudo btrfs fi show
 2Label: none  uuid: 16df0a0e-1dad-439f-aee1-f9961122fe59
 3        Total devices 7 FS bytes used 31.16TiB
 4        devid    1 size 14.55TiB used 13.96TiB path /dev/sdh
 5        devid    2 size 0.00B used 874.01GiB path /dev/sdd
 6        devid    3 size 12.73TiB used 12.14TiB path /dev/sda
 7        devid    4 size 12.73TiB used 1.07TiB path /dev/sdb
 8        devid    5 size 12.73TiB used 1.07TiB path /dev/sdc
 9        devid    6 size 12.73TiB used 1.07TiB path /dev/sdf
10        devid    7 size 12.73TiB used 1.07TiB path /dev/sdg

Let that complete, and the device should now be removed from the Btrfs array. You’re now missing several files that were unrecoverable, so hopefully those were backed up somewhere else or they weren’t critical.