This is going to be a post with a pretty limited audience, even by this blog's standards. I'll try to keep this post short so let me get to the point I want to make: If you're generating a random code to represent something (like a94a8fe5), rather than just using hexadecimal you should do the work to convert it to the full alphanumeric space (a-z and 0-9) instead (eg, 3z4xlz).
As for why, you may notice that the second code (3z4xlz) is shorter than the first (a94a8fe5). Despite this, the chance that you randomly generate two identical codes by accident (a 'collision'), is about the same for both (technically it's twice as likely for the 6 character one vs the 8 character hex one, but we're looking at orders of magnitude of probability here, and you could change it to 7 characters and the chances would be far lower than the 8 character hex one).
If you're not a software developer, this might seem like a weird point to be making, but hexadecimal is very common in programming, and as a result it's very easy to output. When a programmer needs a random code, they will often just generate a bunch of hexadecimal characters and truncate it to however many they need. It takes extra work to convert them to the full alphanumeric space. It's not a lot of effort, but most don't bother to do it; most likely because they don't think to do it.
You may now be asking, "Well why not use upper and lower case letters and numbers, won't that be even better?" and you'd be right. However, I often don't actually use that range, for three reasons. First, by having both upper and lower case letters you increase the chances of confusion if a person ever has to read the codes. Second, it's not as big of an improvement as you may think (if you're generating 1000 codes of 8 characters each, the chance of a collision goes down by 3 orders of magnitude if you go from hex to the full alphanumeric space, and then by 2 more orders of magnitude if you then increase to both upper and lower case [1 in 8,590; 5,642,220; and 436,680,211 respectively]). And last, the full upper and lower case (and numbers) space is 62 characters, and at that point you might as well just pick 2 more characters you are ok with (like _ and . ) and just use base 64, which is easy to generate and work with.
Now, if you're asking "How did he calculate those 1 in x probability so quickly and easily?" have I got an answer for you. Since I create these sorts of codes at both work and in my personal life somewhat often, I created a tool so I can cite it when discussing this with others and trying to convince them why they shouldn't use hex for this.
https://wetzel.dev/tools/collisions.html
Like most things I make, it has a bit of a learning curve, but I think once you get the hang of it, it is very easy to use to make these types of comparisons. The idea is you would use this to help answer a question like "I'm going to be generating random codes, and I think I might generate X number of them total, and I'd like the risk of a collision to be below 1 in Y odds, how many characters do I need to use?"
Maybe this post wasn't actually brief, but if you remove the parentheticals it's about 10 words, so that counts for something. I'll end this post by just quoting the description from the page here for some reason.
This tool calculates how likely it is that when generating random strings of characters you will get two with the same value (a collision). When generating a large number of these random values the chances of a collision goes up quickly, often much faster than expected. For example, if you're generating random 4 digit numbers, there are 10,000 possibilities, and so the chance of any one of those random numbers being the same as another is 1 in 10,000. However, every time you generate a new number you must compare it to all prior numbers, and so the number of comparisons can get very large, which can lead to the chance of a collision being higher than expected. The Birthday Paradox describes the surprising fact that it only takes 23 people before the chance that two of them will share the same birthday to be over 50%. That is not a paradox, but the number is much lower than people generally expect.
I made this tool mainly to show how much better using the full space of lowercase letters and numbers (36 characters) is than just using hex (16 characters) when generating random IDs. Developers often default to generating random IDs in hex because it's easy, but using the full space of the alphabet greatly reduces the chance that two randomly generated IDs will be the same, and including upper and lowercase letters reduces it much more. As an example, if you're going to have 6 character IDs, the chance of a collision if you generate 1000 is 2.9% with hex, 0.023% with alphanumeric, and 0.00088% with upper and lower alphanumeric. You could generate over 50,000 of the case sensative IDs and still have a lower chance of a collision (2.18%) than if you used just hex and only generated 1000 (2.93%).
Note this tool is using native javascript numbers, which have a precision limit of about 15 decimal places. When showing the chance of no collisions and using values that give a very small chance of a collision you'll see 1, when really the answer is a very small number less than 1. Just know that the chance of a collision is never 0, and so the chance of no collision is never truly 1. Viewing the chance of a collision (rather than the chance of no collision) should never round to 0.







