This is a fairly common question. People want to know if the file they received is identical to the original source. Most of the time people look at the file size and if its the same, then that must mean the file is the same… but this is completely false. While file size is a crude way to do it, there are simple tools based on complex math that will give you precise answers called hash values. How precise you ask?
Well most people know that everything on a computer is stored in a long string of 1 and/or 0. So every movie, picture or song you have is just a collection of those two numbers. So lets say you have a regular size audio file that is 3 minutes long. We can assume that the file is 3 megabytes in size, so how many 1s and 0s do we have? Each 1 or 0 is a bit, there are 8 bits in a byte, 1000 bytes in a kilobyte and 1000 kilobytes in a megabyte. So how many bits do we have in total? Approximately 24,000,000. If you change the value of any one of those bits then it changes the value of the hash and two files will not be deemed identical. I’d say thats good enough for most people I know 🙂
So why do people want need or want this? A couple of scenarios…
- Many people copy home movies over multiple times, from CD to DVD, DVD to disk, over network to internet storage. You get my drift. If a file is valuable enough, you want to be sure that it is not changing or damaged along the way.
- If you have legal documents, you can ensure they have not been changed with a hash value. Hashes are commonly used as a way to audit whether digital records have been changed and they are indeed used in court. It almost acts as a digital fingerprint for a file, so if the file has been changed, the fingerprint will be different.
- Its a secure way to store data. Online websites don’t store your password in plain text in their databases (correction, they SHOULDN’T) they store the hashed value of your password therefore even someone with access to the database could not get your password, remember hashes are one way functions and that means exactly the way it sounds. Once the value goes in and is computed, it cannot go the other way. In other words, you cannot reconstruct the value of the original file from the hash value.
So what does this look like in action? I created a file that contains only the word “hello”, its hash value using MD5 (this is one hashing algorithm, and there are many, but for regular home use this is more that sufficient) so the hash value for the file is b1946ac92492d2347c6235b4d2611184
Now I changed the contents of the file to “bye” and the hash value is 91fc14ad02afd60985bb8165bda320a6
You can see that they are completely different and pretty much random. So how can you use this neat tool? On Linux and Mac OS X you have command line utilities called md5 and md5sum. For those who don’t know what a command line is, search on download.com for hash generators or md5 and you will get a list of easy and free apps for your operating system of choice. These apps allow for quick point and click comparison of files or hash value generation.
Hashes are used all over the place in the security world. It’s used to validate that e-mail has not been altered, in secure chat, to store information securely etc etc….
Again, this is meant as an easy overview of something that may be useful to regular people or small businesses. Sometimes you don’t see the value in something unless its explained very simply and in common scenarios. Leave a comment if you have a question! BTW, your comments and e-mails are helpful, I will continue to write posts based on the comments that I get!
NOTE: I have gotten many messages that MD5 is worthless and can no longer be trusted. This is not true, while it has been comprimised the hack is still VERY difficult and MD5 is more than sufficient for basic home or business use. If you believe that one kink in the armor makes it worthless then use SHA1 – BUT this too has had kinks in the armor latelely just to a lesser extent than MD5. All these hashing algorithm do essentially the same thing, please keep your comments coming… is MD5 dead? I don’t think so…. WEP wasn’t dead when it was cracked because the hack was too difficult and took too long, it was two years later when you could crack a network in 1 minute that its dead baby. Most folks just want to compare one or two files and aren’t testing the results for collisions or againt rainbow tables. While these are all valid points we must note MD5 has been in use for a long time for a reason 🙂