Thursday, July 9, 2020

A Graphical Analysis of Women's Tops Sold on Goodwill's Website


I set up a script that collected information on listings for more than four million women's shirts for sale through Goodwill's website, going back to mid-2014. The information is deeply flawed—a Goodwill online auction is very different from a Goodwill store—but we can get an idea of how thrift store offerings have changed through the years. There's more info on data collection method below.

Wednesday, July 1, 2020

Using AWS S3 Glacier Deep Archive For Personal Backups

I've been using AWS S3 for personal backups, and it's working well.  The hardest part of doing anything in AWS is that you have no idea what it will cost until you actually do it; they are masters of nickle and dime charging.  With that in mind, I wanted to wait until I had a few months of solid data before reporting on how it's been working for me.

If you know me, this may surprise you, but my backup strategy is a bit complex.  However, the relevant part for this post is that my documents folder is about 16 GB and I'm keeping a full backup of that, with daily diffs, for about $0.02 a month.

Costs

I did a post estimating the costs last year, and the result has lined up with that.

Here is the relevant part of my AWS bill for May 2020 (June looks to be the same, but isn't complete yet):

There are also some regular S3 line items, since I believe the file list is stored there even when the files are in Deep Archive.  However, I'm far below the cost thresholds there.

Process

I have a local documents folder on my SSD, that gets backed up to a network version nightly via an rsync script.  Folders that are no longer being updated (eg, my school folder) I will delete from my local version and just keep on the network version.

Every month I create a full zip of my local documents folder and upload to S3.  Then every day I create a zip of just the files that have changed in the last 40 days.  I chose 40 days to to provide some overlap.  You could be more clever and just get files that changed since the first of the month, but I wanted to keep the process simple due to how important it is.  I also do a yearly backup of the full network version of this folder, which has a lot of stuff that hasn't changed in years in it.

The result is that I could do a full recovery by pulling the most recent monthly backup and then the most recent daily backup, and replacing the files in the monthly with the newer versions from the daily.  I'd also have to pull the most recent yearly, and extract that to a separate location.

This feels like a pretty simple recovery, all things considered.

Scripts

The full backup:

And the diff backup:


If you want to adapt these scripts it should be pretty straightforward.  You'll have to have 7zip installed and have the command line aws client set up.  Create a nice long random password and store it in the password file.  Make sure you have a system for retrieving that password if you lose everything.

There's a feature to warn if the compressed file is larger than expected, since that will cost money.  The numbers are arbitrary, and work for me, you'd have to adjust them.  Also if you want to get the emailed warnings you'll have to set up mail and change the email address.

If you do want to use S3 Deep Archive for backups I really recommend reading my previous post, because there are a lot of caveats.  I highly encourage you to combine your files into a single file, because that will reduce the per file costs dramatically.

Also, note there is nothing here to delete these backups.  If all you care about is being able to restore the current version, then you can delete any but the newest version.  Keeping them all gives you the ability to restore at any point in time.  If you do delete them, keep in mind there is a limit to how fast you can delete things on Deep Archive.

Epilogue

I realize there are easier, free-er, and arguable better solutions out there for personal backups.  That's it, I don't have a 'but,'.  If you're reading this blog, this should not be a surprise.  Now that I have real data, I'm thinking about backing up some of my harder to find media here too.  I estimate 1 TB should cost about $12 per year in any of the cheapest regions.