Wednesday, December 29, 2010

Storing Passwords Securely

Recently, Mozilla has admitted to accidentally leaking password info for 44,000 accounts.  A few weeks ago, Gawker had accidentally leaked password info for a million+ accounts.  These incidents have spawned a lot of discussion about what was done wrong and what changes should have been implemented.  I wrote something about picking good passwords before, so I'll just address how to securely store passwords.

I won't attempt to address specifics for how a website should store its password database, or how they can prevent leaks.  That is a matter of internal policy, and isn't relevant to what I want to discuss here.  I'll begin with the assumption that passwords are stored in a database and that this database will be leaked at some point.  The question is what practices should be put into place to mitigate damage when that leak happens?

The answer is actually quite simple, and easy to implement.  When the user creates a password (with no practical limits on length or characters) this password should be combined with a per-user unique salt, and then a cryptographically secure hash (e.g. SHA-2bcrypt) should be taken and stored with whatever token the system uses to ID users (e.g. username).  This process can be repeated (i.e. feeding the hash result back into the hash function) many times if one wishes to increase computational time needed for an attack.  There are many ready-made functions to do just this, in any language a site is likely to be using for its backend database work.  Thus, implementing this process is simple.  Indeed, it's likely that using a ready-made and secure function is going to be easier than reinventing the wheel and coming up with a custom in-house function that ignores decades of cryptographic research.

So what does all that above mean?  Why not just store the passwords in a database directly?  After all, if the database leaks at all, it is likely to be a very bad thing.  Shouldn't the effort be made in ensuring the database leak doesn't occur in the first place?  Certainly leaks should be prevented, but security is all about assuming compromises and then creating a system that is robust enough to survive them.  Many independent layers should work together to provide overall security.  In many discussions of secure communications there is the assumption that all communications are intercepted.  There are methods to have secure conversations without eavesdropping, but it is still wise to operate from a premise that you can't foresee and avoid every attack.  By making your system secure even if it is intercepted you provide your communications with an extra layer of security.  For the same reason, it is wise to store passwords securely in a database, even while working to prevent that database from leaking.

If you're now in agreement that passwords should be stored securely in a database (which itself should in-turn be stored securely), your next question should be what exactly does the process outlined above mean, and how does it actually provide security?  When a password is stored directly in a database, so that it may be retrieved by the system, it is said to be stored in plaintext.  The answer is not to simply encrypt the passwords, at least not in the traditional sense.  The problem is that the master key used to encrypt the passwords then has to be stored in plaintext.  If the database is leaked, it is safe to assume any key used to encrypt entries in it will also be leaked.

You may have noticed that many websites, when you forget your password, will reset your password to something random and email it to you, as opposed to emailing you the original password you chose.  The reason is that these sites don't know your password at all.  Indeed, a surefire sign of poor security is if, when you recover a password, the site sends the original as opposed to a new random password.  Even with a metaphorical gun to its head, a secure website could never reveal your password, simply because it doesn't know it.  This appears to lead to a paradox, as clearly the website is able to authenticate users.  How can a website know that the password a user enters is correct without itself knowing the password?  The answer is one way functions, and specifically cryptographic hash functions.

Hash functions are rather simple concepts.  One can provide any data, without a lower or upper limit on size, and the hash will output a short hash value of a constant size.  Some key properties of cryptographic hash functions are:

Constant Output Size - The hash always returns a value of a certain size, regardless of input size.  For example, the MD5 hash of 'Monkey' is '4126964b770eb1e2f3ac01ff7f4fd942'; the 591 MB slackware-13.0-install-d1.iso file has an MD5 hash of '6ef533fcae494af16b3ddda7f1c3c6b7'.  Both are 128 bit numbers expressed as 32 hex values.  This is a key fact.  It also shows why limits on password size are silly, and usually a good sign the site is using some insecure method of storing passwords.  No matter how long the user makes their password the size of the value that needs to be stored is the same.

One Way - The next thing is that they aren't reversible.  Given a hash there is no way to figure out what data created that hash, short of brute-force trying of every possible combination of data until the same hash is found.  Even then, if the input data is random there is no way of knowing for sure that the correct input has been found.  It should be obvious that if 591 MB can be reduced to a 16 byte number, then clearly there must be multiple inputs that give the same output.  While the number of outputs possible with 128 bits is quite large, it is considerable smaller than the infinite number of inputs possible with any arbitrary length.

Collision Resistant - When two different inputs create the same output it is known as a collision.   As seen above, collisions are unavoidable.  However, in order for a hash function to be considered cryptographically secure they must be unpredictable.  In other words, given an output hash value, if there is any method for finding a collision that is easier than just trying possibilities until one is found, then the hash is considered flawed.  How much easier it is to find collisions than a brute-force will determine how seriously flawed the hash is.  Attacks on the common MD5 hash have found collisions with complexity of 291.  This has the practical effect of reducing the security of the MD5 hash from 128 bit to 91 bit.

Avalanche Effect - Another very important factor for cryptographic hash functions is that very small changes must lead to very large changes in the hash.  As I stated above, the MD5 hash for 'Monkey' is '4126964b770eb1e2f3ac01ff7f4fd942'.  On the other hand, the MD5 hash for (going from capital 'M' to lower case 'm') 'monkey' is 'd0763edaa9d9bd2a9516280e9044d885'.  Notice that there is a single byte change, and that made a dramatic change in the output.  Given the three hashes calculated here, and told the three inputs they were derived from, but not which matched with which, it would be impossible to determine which two had nearly the same input.  There are just under five billion bits in the above mentioned Linux iso.  Flipping just a single arbitrary bit from 0 to 1 or vice versa would result in totally different output.  I'm too lazy to demonstrate, but suffice it to imagine a random 128 bit number.

In the parlance of cryptography this is called the avalanche effect.  It is the reason hashes are often seen when downloading a file.  By checking of hash of the file you receive to the one the website lists you can be sure there were no transmission errors.

Benefits of Hashes:
Now that we've covered hashes, let's review what benefits using them provides.  When the user sets their password the system finds the hash of that password and stores that.  Since hashes are fast to compute and output a short number there is little overhead in using them.  When the user attempts to log in they enter their password and the system computes the hash of it, compares that hash to the one on record and if they match the user is given access.  When an attacker gains access to the database all they have is the hashes of the passwords.  As we learned above there is no way to reverse the hash to find the passwords.  Since the system takes the password input from the user and calculates the hash itself, this means that even with the hashes an attacker can't gain access.  If he submits the hash the system will calculate the hash of that, which will not match the hash on file.  The attacker must find the input that gave the hash, and the only way to do that is a brute-force.  In addition, this provides some security to users that use the same password in multiple places.  Since the attacker still doesn't know the actual password, he can't use it to gain access anywhere.

If you were following the Gawker leak, you know that lists of the most common passwords were spread online.  Gawker did not store the passwords in plaintext, they only stored the hashes.  How then, were the plaintext versions found?  The answer shows why simply using hashes (as Gawker did) isn't enough.  In addition to only storing the hash one must add salt.

Before explaining salt, first I'll answer how the Gawker passwords were found from their hashes.  There are a handful of cryptographically secure hashes in widespread use.  MD5 is probably the most common, but there are also the newer SHA-1 and SHA-2 (aka SHA-256/512).  Since there are only a handful of hashes used by everyone a new attack arose.  Attackers can take a dictionary of common password, or even a compressive list of all possible combinations within certain specifications (e.g. all alphanumeric combinations from 1 to 8 characters) and precompute the hash for all of them.  This process may take a long time; however, once it's done the results can be used over and over again and even shared.  The resultant database of possible passwords and their hashes is known as a rainbow table, and is indeed, very common.  An attacker with a rainbow table and a database from a compromised web site can quickly find all the passwords whose hashes have been precomputed in his rainbow table.

The answer to this attack is to salt the passwords before storing them.  By adding some data to each password before the hash is calculated one ensures that the outputted hashes won't be in any premade rainbow tables.  This data is known as the salt.  While simply adding a random piece of data to every password one can ensure that any rainbow table must be specifically computed for this one specific website, an even better practice is to use a piece of data that is unique to every user.  For example, a username, email, or UID would be ideal choices to add to the password as a salt.  Even though those pieces of info aren't secret they will make premade rainbow tables useless.  In addition even a custom rainbow table for an entire web site, which could be worth the effort, would be useless.  The only options an attacker has are to either brute-force each password individually, or attempt to precompute rainbow tables containing password+salt combinations.

In a well known example, early (1970s) Unix implementations used an only 2 byte salt.  This created 4096 possibilities for each password.  It therefore, increased the rainbow table size needed to cover any set password space by a factor 4096.  While in the 70s this was adequate, by the 2000s storing the full rainbow table with salts became practical, and any *nix versions still using the old password system had to be updated.

Salts must be stored in plaintext, since the system needs to use them to find the hashes.  As such, they provide no extra security to a user using a weak password from a brute-force standpoint.  However, if large enough, they all but eliminate the possibility of rainbow table attacks.  Since there is little downside to using longer salts, a modern secure system might use something like this: MD5(user_password + UID + 256_bit_pseudorandom_salt) with the 256 bit salt being a random number created by the system for each user when they create a password and stored, in plaintext, in the database along with their UID and hash of password.

With the combination of hashes and long salts a website ensures that even if their database is leaked the only way for an attacker to get access to passwords is with a brute-force attack.  Which, if the user uses a sufficiently long password, should be prohibitively time consuming, to the point of being impossible.

Additional Security:
Additional difficulty in brute-forcing can be created by recursively running the hash through itself.  If a hash takes one millionth of a second to run on a modern system, then running the hash a million times over and over for each password will increase the time taken to one second.  While providing extra security, this only scales linearly as opposed to exponentially (as is the case with increasing hash or salt size).  Since it creates an identical slowdown for both the attacker and the website this is viewed as a limited layer of security.

To review, a proper modern system might look like this:
$userpass; //the user's inputted password
$uid; //unique id for each user used as database key
$salt; //unique and random 256 bit salt created for each user
$hash; //the hash of the user's pass + salt stored in the database

//user registration:
input $userpass; //get password from user
input $uid; //generate $uid or get from user
$salt = ran(0, 2^256); //create a random 256 bit salt
$hash = bcrypt($userpass + $salt); //find bcrypt hash of password+salt
//store $uid, $salt, and $hash in database, don't store $userpass

//user login:
input $userpass; //get password from user
input $uid; //get user's uid
//find $salt and $hash from database using $uid as a key
if $hash == bcrypt($userpass + $salt)
//if hashes match access granted
//if hashes don't match access denied

In the time since writing this I've been swayed on the effectiveness of massive GPU parallelization in generating hashes fast enough to brute force passwords with salts. This doesn't change the overview of password storage and why salts are needed, but it does change my recommendation to use SHA-512 and my aversion to recursively using the hash to slow down the process.  A thousand dollars worth of GPUs can generate 250 million SHA-512 hashes per second (or 20 billion MD5 hashes).  Modern hashes intended for password storage are designed to be harder to massively parallelize.

With that in mind, I'll revise my advice.  Instead of SHA-512, one should use bcrypt (or similar) with the number of repetitions set high enough so that hashing takes at least 0.1 seconds on modern hardware.

Also, I was well aware that MD5 was hopelessly broken when I wrote this, but I didn't make that clear at all in the post.

1 comment:

  1. You covered it all, Dale. What a thorough explanation of all the components of a safe login scheme. Hashing, salts, and multiple passes - the bare minimum required for safety. You really explained it well.