Thursday, July 9, 2020

A Graphical Analysis of Women's Tops Sold on Goodwill's Website


I set up a script that collected information on listings for more than four million women's shirts for sale through Goodwill's website, going back to mid-2014. The information is deeply flawed—a Goodwill online auction is very different from a Goodwill store—but we can get an idea of how thrift store offerings have changed through the years. There's more info on data collection method below.

Wednesday, July 1, 2020

Using AWS S3 Glacier Deep Archive For Personal Backups

I've been using AWS S3 for personal backups, and it's working well.  The hardest part of doing anything in AWS is that you have no idea what it will cost until you actually do it; they are masters of nickle and dime charging.  With that in mind, I wanted to wait until I had a few months of solid data before reporting on how it's been working for me.

If you know me, this may surprise you, but my backup strategy is a bit complex.  However, the relevant part for this post is that my documents folder is about 16 GB and I'm keeping a full backup of that, with daily diffs, for about $0.02 a month.

Costs

I did a post estimating the costs last year, and the result has lined up with that.

Here is the relevant part of my AWS bill for May 2020 (June looks to be the same, but isn't complete yet):

There are also some regular S3 line items, since I believe the file list is stored there even when the files are in Deep Archive.  However, I'm far below the cost thresholds there.

Process

I have a local documents folder on my SSD, that gets backed up to a network version nightly via an rsync script.  Folders that are no longer being updated (eg, my school folder) I will delete from my local version and just keep on the network version.

Every month I create a full zip of my local documents folder and upload to S3.  Then every day I create a zip of just the files that have changed in the last 40 days.  I chose 40 days to to provide some overlap.  You could be more clever and just get files that changed since the first of the month, but I wanted to keep the process simple due to how important it is.  I also do a yearly backup of the full network version of this folder, which has a lot of stuff that hasn't changed in years in it.

The result is that I could do a full recovery by pulling the most recent monthly backup and then the most recent daily backup, and replacing the files in the monthly with the newer versions from the daily.  I'd also have to pull the most recent yearly, and extract that to a separate location.

This feels like a pretty simple recovery, all things considered.

Scripts

The full backup:

And the diff backup:


If you want to adapt these scripts it should be pretty straightforward.  You'll have to have 7zip installed and have the command line aws client set up.  Create a nice long random password and store it in the password file.  Make sure you have a system for retrieving that password if you lose everything.

There's a feature to warn if the compressed file is larger than expected, since that will cost money.  The numbers are arbitrary, and work for me, you'd have to adjust them.  Also if you want to get the emailed warnings you'll have to set up mail and change the email address.

If you do want to use S3 Deep Archive for backups I really recommend reading my previous post, because there are a lot of caveats.  I highly encourage you to combine your files into a single file, because that will reduce the per file costs dramatically.

Also, note there is nothing here to delete these backups.  If all you care about is being able to restore the current version, then you can delete any but the newest version.  Keeping them all gives you the ability to restore at any point in time.  If you do delete them, keep in mind there is a limit to how fast you can delete things on Deep Archive.

Epilogue

I realize there are easier, free-er, and arguable better solutions out there for personal backups.  That's it, I don't have a 'but,'.  If you're reading this blog, this should not be a surprise.  Now that I have real data, I'm thinking about backing up some of my harder to find media here too.  I estimate 1 TB should cost about $12 per year in any of the cheapest regions.

Saturday, April 4, 2020

Stateless Password Managers

An idea I've had for a while is a password generator where you take a master password, an optional per site password, and the site domain name, combine and hash them to get a unique password for any site.

This system has a unique benefit over traditional password managers in that you can't lose your passwords.  Even if all your electronics were destroyed and you woke up naked in China tomorrow you could get your passwords just by using an online version of the tool (or failing that, manually doing the steps yourself with a hash generator).

However, the system has a unique drawback of not remembering what the password requirements are.  Some sites require special characters, some don't allow them, some require more than 10 characters, some allow for a max of 8.  It would be easy to translate your hash into whatever set of requirements you have, but you still need to either remember that, or store it somewhere else.

Today I discovered this idea has been implemented, a lot.  It's called a stateless password manager, or a deterministic password manager.  Two examples are:

https://masterpassword.app/

https://lesspass.com/#/

And here is an article discussing the flaws in this system:
https://tonyarcieri.com/4-fatal-flaws-in-deterministic-password-managers

Tuesday, March 24, 2020

Social Distancing Scoreboard

According to the World Health Organization and the CDC, social distancing is currently the most effective way to slow the spread of COVID-19. We created this interactive Scoreboard, updated daily, to empower organizations to measure and understand the efficacy of social distancing initiatives at the local level.


https://www.unacast.com/covid19/social-distancing-scoreboard

Sunday, March 15, 2020

How do laser distance measures work?

I recently bought a laser tape measure; it's pretty great.  One button to turn it on, then it gives you instant distance measurements to wherever you point the laser.  There are more expensive ones that do further distances, but the one I got was $30 and goes up to 65 feet.  I compared it to a normal tape measure and it was accurate and repeatable to an eighth of an inch.  I was pretty impressed with it, and it was a great toy to add to my collection of measuring devices.

However, I began to wonder how it worked, especially since it worked so well, and was so cheap.

How laser distance measures don't work

In principle it would be simple.  Light has a very well known speed, so all you have to do is measure how long it takes for the light to go out and reflect back.  Distance = speed x time.  You could encode a binary number in the laser, just a counter incrementing and resetting when it runs out of numbers.  Measure what number is being reflected back and how long ago you sent that number out and you know how long it took to come back.



However, the devil is in the details, and getting that time precise enough to measure an 1/8th of an inch is going to be hard.

An 1/8th of an inch is 3.175 mm.  The speed of light is 299,792,458 m/s.  Or 299,792,458,000 mm/s.  3.175 mm / 299,792,458,000 mm/s = 1.059066002254133e-11 seconds.  Which is about 10.59 picoseconds.  Take the inverse of that and it's 94.42 Gigahertz.  I'm going to go out on a limb and assume that the $30 laser tape measure I have in my pocket doesn't have a 100 GHz clock inside of it.

How do they actually work?

Instead of transmitting a counter, just send an alternating pulse.  It doesn't have to be very fast, a MHz would be enough.  Then your reflected pulse is the same wave, but delayed slightly.  You only care about measuring the difference in time of the leading and falling edges of the two waves, or delta.  This means you can just compare the two waves using an XOR gate, which is just a fancy way of saying "tell me whenever these waves are different".

Here's an example


Where the top red line is the original signal, and the second blue line is the reflected version.  Then the third green line is the XORed delta of the two.

When you measure something slightly further away the reflected wave gets more delayed and the delta version gets a longer pulse.


Are logic gates fast enough? 

Logic gates like these are cheaper and faster than the circuitry you'd need for a timer.  However, they still aren't quite fast enough for the precision we see in these tools.  Luckily though, a delay doesn't really impact the measurement.  As long as it's a consistent delay on both the rising and falling edges of the two waves.


All you end up with is a slightly offset delta signal.

Who will measure the measurer?

It might seem like we're back to square one here, with the need to precisely measure the time of that pulse, but we actually just need take the average of that signal.  There are a variety of ways we can do this, but as a proof of concept, imagine the delta signal is charging a capacitor, which is simultaneously being drained by a constant resistor.  You'd end up with a level of charge in the capacitor which would translate into what percentage of time the delta single is high.

Now, all you have to do is measure the charge in the capacitor and turn that into a measurement you display.  Let's review what we need:
  • Laser transmitter and optical sensor.
  • MHz clock to turn laser on and off.
  • XOR circuit to compare the two transmitted and received signals.
  • A capacitor and resistor circuit to find average of the digital signal.
  • A way to measure the charge in the capacitor.
  • Something to take that measurement and convert it into the distance.
  • A display.
None of this is very expensive.  I'm pretty amazed they can combine them for less than $30, but at that point, you'd be losing money not to buy one.

Saturday, February 29, 2020

Guessing Smart Phone PINs by Monitoring the Accelerometer

https://www.schneier.com/blog/archives/2013/02/guessing_smart.html
In controlled settings, our prediction model can on average classify the PIN entered 43% of the time and pattern 73% of the time within 5 attempts when selecting from a test set of 50 PINs and 50 patterns. In uncontrolled settings, while users are walking, our model can still classify 20% of the PINs and 40% of the patterns within 5 attempts.