Intro
I recently added some hard drives to my home media server, and have finally decided to add some redundancy there. I've long resisted using RAID, because it only protects against one very specific form of data loss (drive failure), while ignoring things like file corruption, or just user error. On the other hand, mirroring data to another drive is such a waste of disk space that I could never bring myself to do it, especially for stuff I could ultimately redownload if I needed to. I'll explain my solution, although if you've read the title you may already have guessed what it is. First though, I'd like to go over what RAID is and why I don't like it.
What is RAID?
If you know what RAID is, you can skip over this section. Alternatively you can read this better overview.
RAID stands for Redundant Array of Inexpensive Disks. The idea being you buy cheaper drives, which may be more likely to fail, but then have some redundancy between them, so that if any do fail you don't lose data.
There are many types of RAID setups each with different numbers, and we'll go over a few of them. All RAID setups share some things like all your drives are combined to look like one large drive. There are four criteria that vary between the different RAID types to consider when thinking about different RAID setups:
- How many drives can I lose before I lose data?
- Space efficiency (how many drives are wasted on redundancy?)
- Read performance
- Write performance
RAID 1
RAID 1 is the simplest; it is just 2 drives where they each hold a copy of your data. In RAID 1 if you have two 12 TB drives it would look like you just have one 12 TB drive, but there would be a real time copy of it on the second drive. If you have more than 2 drives the same concept applies, where half the drives are devoted to mirroring the other drives.
In RAID 1 you can lose half your drives before you lose data, however, if you have more than 2 drives that will depend on which drives fail, since they are paired up. If you had a 6 drive RAID 1 setup, you could lose data with 2 drive failures, or be fine with 3 failures, depending on which drives fail (but you'd always be safe for 1 drive failure).
Lastly, the read and write performance of RAID 1 is interesting. While the write speed isn't affected much, the read speed is roughly doubled. This is because half of each file can be read from each drive. Which leads us to our next RAID level...
RAID 0
RAID 0 is similar to RAID 1, except only half of each file is stored on each drive. This means that no single drive has any full file. The benefit of this is that you get roughly double read and write speed, since you are only reading and writing half the file on any single drive. RAID 0 also doesn't lose any drives to redundancy. If you have two 12 TB drives, your RAID 0 system will have 24 TB available.
If you haven't figured it out already, there is one downside to RAID 0. You will lose all your data if you lose a single drive. RAID 0 has no redundancy (hence why it's RAID "0"), and in fact, it's worse than not using any RAID, because a single drive failure will destroy all your data on both drives, rather than just one.
RAID 0 isn't typically used by itself, but often combined with RAID 1 to form various permutations of RAID 10. RAID 10 is just 4 drives in some combination of RAID 1 and 0. I'm going to skip over giving more details because this background section is already grown much too long.
RAID 5
This is where RAID gets interesting. In RAID 5 you have 3 or more drives, and you lose exactly 1 of those to parity data. We haven't mentioned "parity" yet, but the idea is that we can calculate a checksum based on the combined data on the rest of the drives, and that checksum is enough to recover the data if one of those drives fails.
With RAID 5 you lose one drive to redundancy, which means it becomes more efficient as you get more drives (with 2 drives it would be the same efficiency as RAID 1, while having worst performance). The read performance of RAID 5 is the same as no RAID, but the write speed takes a hit because the parity needs to be calculated in real time for everything you write to the disk.
Another thing to consider is that you can safely handle a single drive failure, but as the number of drives you have goes up, the odds of a second drive failing before you are able to recover the first drive goes up. Therefore, it's a balancing act of how many drives you "waste" on parity vs how many you have for data.
RAID 6
RAID 6 is just RAID 5, but with two parity drives instead of one. Everything I said about RAID 5 applies, just with two drives for redundancy vs one. This makes more sense as you get more drives, but the trade off is losing more usable space. The key thing here is that you can lose the same number of drives as you have parity drives. It doesn't matter if the lost drives are data drives or parity, or any combination of them. If you have 2 parity drives, and you lose 2 drives, you will be able to recover.
This page is a good summary of the RAID levels, including some I didn't talk about.
What is parity?
I want to explain what I mean by "parity" when discussing redundancy, mainly because I think it's a pretty cool concept. Again, feel free to skip this section if you understand what parity is.
In its simplest form, parity is just combining the bits on the drive using the XOR function. XOR stands for exclusive OR. I'm going to restrain myself from explaining XOR in depth, and just say for our purposes, XOR asks the question "Are there an odd number of 1 bits?". So if you are looking at 3 bits, and they are 1 0 1, then the answer to that question is no (there are two 1s, and two is even). Therefore XOR(1 0 1) = 0 and then XOR(0 0 1) = 1 because there is one 1, and one is odd.
Now here's the really cool thing about XOR. If you take the output of XOR and store it along with the inputs, you can remove any one of those inputs and the XOR on the remaining inputs along with the output will equal the missing input. For example XOR(1 0 1) = 0, if we lose the first bit there, and instead have X 0 1, we can just take the XOR of what we have left, plus the output (which was 0) to get XOR(0 1 0) = 1. I know that is confusing, but as another example: XOR(1 1 1) = 1 (three 1s, three is odd), so we store 1 1 1 1 (the last bit is the parity bit that we got from the XOR), and then if we lose any of those 1s, we can calculate XOR of what remains and we know XOR(1 1 1) = 1. Another quick example:
XOR(1 0 0) = 1
Store that combined as 1 0 0 1
Then lose one bit: 1 X 0 1
Calculate XOR of what remains: XOR(1 0 1) = 0
Replace the lost bit above with the 0 we just got: 1 X 0 1 -> 1 0 0 1
I made a spreadsheet to show this with full 8 bit bytes. This is representing 3 drives (d1 through d3), each with 8 bits of data on them (lettered A through H), and then a parity drive (p1) with the calculated XOR of each column. If you look at any column and count how many 1s there are (excluding the final row, which is parity), then the final row should have a 1 if there were an odd number of 1s and a 0 if there were an even number of 1s. So in the below image, column A has two 1s, so the parity bit is 0. Column C has one 1, and so the parity bit is 1.
The first table in the image shows the three data drives and the parity drive all full. The middle table shows d2 having failed, highlighted in red, now replaced and empty. The final table shows d2 after it was replaced by calculating the XOR of the other white rows. In the first and last table the gray row is the result of XORing the 3 white rows.
This scales up to any number of drives, and we always only lose a single drive to parity. Here we have 7 data drives and only 1 parity drive, and still can lose any drive and then recover.
You may now be asking how RAID 6 works with 2 simultaneous parity drives, where you can lose any 2 drives and then recover. The answer is the math becomes much more complex, and I can't explain it, but if you want to read through it, this post does a good job of explaining it.
Why don't I like RAID?
We've covered a lot of ground for me to be able to answer this question. First off there are some practical concerns I haven't really touched on. RAID is annoying to set up, and requires identically sized drives. This is fine for a data center that is going to have many servers each running RAID, where the drives only need to match in a single server, but for the home user, whose storage is going to grow organically, it's very annoying to have to replace all your drives every time you want to increase your storage.
The bigger problem with RAID though, is that it's not a backup. I mentioned this above, but RAID protects you against your drive failing, and nothing else. It does not protect against software corruption, user error, or your house burning down. If you delete a file from a RAID server, and then realize that was a mistake, it's too late, it and the redundant copy are gone. If a file gets corrupted when you save it, the corruption is also instantly copied to the parity.
What about Unraid?
Unraid is a paid operating system, which aims to make RAID much more user friendly. Despite its name, Unraid really is just RAID, but with a lot of the annoying parts eliminated. It handles the problem of having a mix of drive sizes, and allows adding new drives to an existing array.
I strongly considered Unraid for my home server. The main reasons I didn't go with it though, are:
- It isn't cheap. It's either $50/year or $250 for a lifetime license.
- It requires starting with empty drives. While you can add a new blank drive to an existing Unraid array, if you are starting out with full drives you will have to buy enough empty drives to start a new empty array, and then move stuff onto them. This also means all the configuration I have set up on my server (Home Assistant, Plex, etc) would be lost, or at least would require me to migrate it all over, and deal with the downtime while I did that.
- Finally, Unraid is not any more of a backup than RAID is. As far as parity goes, Unraid is the same as RAID 5, 6, etc depending on how many parity drives you use. But if you corrupt a file, you still have no backup of it.
Snapraid to the rescue
I discovered something called Snapraid, and instantly knew it was the best choice for me. Snapraid is an open source command line tool which calculates parity for any number drives and stores it as a single parity file on another drive.
Snapraid has one main catch, which is both a pro and con: It only runs on demand, meaning that when you make changes, they aren't added to the parity until you actually run the snapraid command to recalculate the parity. This however, means it serves as a sort-of-backup. Things like corrupted files or mistakenly deleted files can be recovered, as long as you discover the problem before running the parity check.
Snapraid works best where you have a large amount of media that rarely changes. This is my exact use case, so it works very well for me.
The drawbacks of Snapraid are:
- It doesn't combine drives into one filesystem (although it often combined with mergerfs, which does exactly that).
- It's a command line tool (although really pretty easy to use).
- Updates only happen on demand, which means anything you write will not be protected until you recalculate the parity.
- Unintuitively, updating a file will cause other files to be unprotected until you rerun the parity command. I'll explain this one more later, it's the biggest catch.
The benefits of Snapraid are:
- It's a free open source command line tool. I realize I had command line as a drawback above, but it will be a pro or a con depending on your point of view.
- It can be added to an existing system, very easily. All you need is enough space for the parity file, which means you need one empty drive, which is at least as large as your largest other drive.
- You can use any mix of drives, and add and remove them from the array easily.
- You can use any number of parity drives, to cover whatever level of risk multiple drive failures you're ok with. Their FAQ includes a good guideline on how many parity drives you should use.
- There is no performance overhead when you write (or read) files. It's not running at all aside from whenever you schedule it to run. It only takes a few minutes to run after the first time.
- It has features to help protect against random errors that can happen in RAM, and for which people often use ECC memory with RAID servers to protect against. Namely, it can run the parity calculations twice for each file, and has a "scrub" feature where it can double check the existing parity data for a percentage of your total data. The scrub will also detect corruption that occurs from a failing HDD.
- The fact that it only runs on demand means you can recover from mistakes and corruption as long as you notice before the next run.
I mentioned Snapraid is pretty easy to use. You set up a .conf file, telling it what your data drives are, and which drive to store parity on. You can also exclude files or folders and tell it to ignore them, which is a good option for frequently changing data (although obviously consider some other backup strategy for that data). Once it's set up, you run it with Snapraid sync, and your setup could be as simple as running that via a cronjob every night.
There is a nice bash script that makes running Snapraid much easier. You can read through the readme, but some main features of it are an email report every time it runs, as well as a lot of configuration on when Snapraid should not run. For example, if it detects a large number of files have changed or been deleted, you can have it email you a warning and not run. This would give you a chance to recover those files if it turned out they had been deleted by accident.
I'd recommend checking out that script if you do use Snapraid. From what I can see, the contributors to it are all really cool people.
Biggest gotcha to Snapraid
I mentioned this above, but it's worth reiterating, as it's the biggest potential problem with Snapraid. If you either delete or modify a file, that file will of course not be protected until you run the snapraid sync command again. However, other random files, on your other drives will also lose their protection until you run the command again. To understand why, go up and review those parity diagrams from above. By modifying a file on drive 1, you are changing the parity calculation for the same spot on drive 2, drive 3, etc. Modify the bit in column A of d1, and you can no longer recover the bits in column A or d2 or d3. So, if you delete a 1 MB file on drive 1, you now have a 1 MB hole on drive 2, 3, etc.
There are two ways to mitigate this. First, don't delete and modify things often. And when you know things are going to be modified and deleted schedule that to happen shortly before the sync command is scheduled to run (but not so shortly before that it's still running when the sync command starts). Second, when you need to delete something, you can move it to a snapraid excluded folder. Then the file still exists, and if a drive fails you can move it back to its prior location and be able to recover everything. Once you run the sync command snapraid will calculate a new parity, without the excluded files, and you can delete them whenever you want.
Odds and ends
This, predictably, grew to be a quite a large post. Still there are some random additional thoughts I wanted to include, so here they are.
I've written before how I'm backing up my documents to AWS S3 Glacier. This is still my strategy for the things I care the most about. It gives me daily snapshots and is quite cheap. My AWS bill is currently $0.25/month, and I have years of backups I haven't deleted.
I also have a bunch of systems that I want to be able to recover if their root drive fails. My strategy for these is to mirror some of the folders nightly to a HDD in the server. I exclude that folder from Snapraid, since they are already a mirror from their primary locations. I'm using rsync to sync a few important folders from my desktop and the server itself. I also have a bunch of Pis that I want to be prepared if their SD card fails. I wrote before about doing that with rsync, but I'm now using this script to make a full image of the entire SD card to a NFS drive. That image can be mounted and examined. It works pretty well, although it's not super user friendly. I also found this script, which might be better, but I haven't tried it.
I discovered the site healthchecks.io which I'm a really big fan of. It helps you keep track of all your backup scripts to detect if any start to silently fail. You can create up to 20 checks for free, and for each you get a unique URL, which you then hit as the final step of your backup script and it'll register as a run. Then you can configure the site with how often those scripts should run and it'll alert you if they miss a check in. I have a slightly more complex pattern I've been following which pings the site at the start and end of the script, so it can monitor how long they run for, and will be alerted right away if they fail. You can see an example of that in my backup script here.
Finally, I used the site serverpartdeals.com to buy my additional drives. They are pretty well regarded around the internet, and they have good prices on recertified drives. I've never used recertified drives before, but I figure if I'm going to have them protected with parity anyway, I might as well buy more, cheaper, drives and just increase parity if I feel that is too risky.