Magic hashes – compare, secure and audit your data


This is a fairly common question.  People want to know if the file they received is identical to the original source.  Most of the time people look at the file size and if its the same, then that must mean the file is the same… but this is completely false.  While file size is a crude way to do it, there are simple tools based on complex math that will give you precise answers called hash values.  How precise you ask?

Well most people know that everything on a computer is stored in a long string of 1 and/or 0.  So every movie, picture or song you have is just a collection of those two numbers.  So lets say you have a regular size audio file that is 3 minutes long.  We can assume that the file is 3 megabytes in size, so how many 1s and 0s do we have?  Each 1 or 0 is a bit, there are 8 bits in a byte, 1000 bytes in a kilobyte and 1000 kilobytes in a megabyte.  So how many bits do we have in total?  Approximately 24,000,000.  If you change the value of any one of those bits then it changes the value of the hash and two files will not be deemed identical.  I’d say thats good enough for most people I know 🙂

So why do people want need or want this?  A couple of scenarios…

  • Many people copy home movies over multiple times, from CD to DVD, DVD to disk, over network to internet storage.  You get my drift.  If a file is valuable enough, you want to be sure that it is not changing or damaged along the way.
  • If you have legal documents, you can ensure they have not been changed with a hash value.  Hashes are commonly used as a way to audit whether digital records have been changed and they are indeed used in court.  It almost acts as a digital fingerprint for a file, so if the file has been changed, the fingerprint will be different.
  • Its a secure way to store data.  Online websites don’t store your password in plain text in their databases (correction, they SHOULDN’T) they store the hashed value of your password therefore even someone with access to the database could not get your password, remember hashes are one way functions and that means exactly the way it sounds.  Once the value goes in and is computed, it cannot go the other way.  In other words, you cannot reconstruct the value of the original file from the hash value.

So what does this look like in action?  I created a file that contains only the word “hello”, its hash value using MD5 (this is one hashing algorithm, and there are many, but for regular home use this is more that sufficient) so the hash value for the file is b1946ac92492d2347c6235b4d2611184

Now I changed the contents of the file to “bye” and the hash value is 91fc14ad02afd60985bb8165bda320a6

You can see that they are completely different and pretty much random.  So how can you use this neat tool?  On Linux and Mac OS X you have command line utilities called md5 and md5sum.  For those who don’t know what a command line is, search on download.com for hash generators or md5 and you will get a list of easy and free apps for your operating system of choice.  These apps allow for quick point and click comparison of files or hash value generation.

Hashes are used all over the place in the security world.  It’s used to validate that e-mail has not been altered, in secure chat, to store information securely etc etc….

Again, this is meant as an easy overview of something that may be useful to regular people or small businesses.  Sometimes you don’t see the value in something unless its explained very simply and in common scenarios.  Leave a comment if you have a question! BTW, your comments and e-mails are helpful, I will continue to write posts based on the comments that I get!

NOTE: I have gotten many messages that MD5 is worthless and can no longer be trusted.  This is not true, while it has been comprimised the hack is still VERY difficult and MD5 is more than sufficient for basic home or business use.  If you believe that one kink in the armor makes it worthless then use SHA1 – BUT this too has had kinks in the armor latelely just to a lesser extent than MD5.  All these hashing algorithm do essentially the same thing, please keep your comments coming… is MD5 dead?  I don’t think so…. WEP wasn’t dead when it was cracked because the hack was too difficult and took too long, it was two years later when you could crack a network in 1 minute that its dead baby.  Most folks just want to compare one or two files and aren’t testing the results for collisions or againt rainbow tables.  While these are all valid points we must note MD5 has been in use for a long time for a reason 🙂

Advertisements

4 responses to “Magic hashes – compare, secure and audit your data

  1. Vulnerability

    Because MD5 makes only one pass over the data, if two prefixes with the same hash can be constructed, a common suffix can be added to both to make the collision more reasonable.

    Because the current collision-finding techniques allow the preceding hash state to be specified arbitrarily, a collision can be found for any desired prefix; that is, for any given string of characters X, two colliding files can be determined which both begin with X.

    All that is required to generate two colliding files is a template file, with a 128-byte block of data aligned on a 64-byte boundary, that can be changed freely by the collision-finding algorithm.

    Recently, a number of projects have created MD5 “rainbow tables” which are easily accessible online, and can be used to reverse many MD5 hashes into strings that collide with the original input, usually for the purposes of password cracking. However, if passwords are combined with a salt before the MD5 digest is generated, rainbow tables become much less useful.

    The use of MD5 in some websites’ URLs means that Google can also sometimes function as a limited tool for reverse lookup of MD5 hashes.[10] This technique is rendered ineffective by the use of a salt.

    On December 30th, 2008, a group of researchers announced at the 25th Chaos Communication Congress how they had used MD5 collisions to create an intermediate certificate authority certificate which appeared to be legitimate when checked via its MD5 hash.[5]. The researchers used a cluster of Sony Playstation 3s at the EPFL in Lausanne, Switzerland.[11] to change a normal SSL certificate issued by RapidSSL into a working CA certificate for that issuer, which could then be used to create other certificates that would appear to be legitimate and issued by RapidSSL. VeriSign, the issuers of RapidSSL certificates, said they stopped issuing new certificates using MD5 as their checksum algorithm for RapidSSL once the vulnerability was announced.[12]
    http://en.wikipedia.org/wiki/MD5

  2. MD5 is broken, and its checksums are not trustworthy.

    http://www.mscs.dal.ca/~selinger/md5collision/

  3. Thanks for the input, I added a note to the end of the entry to address your comments.

  4. If you’re going to write articles educating people, you need to be a lot more responsible. Specifically, you seem to have written this for newbies, so it’s even more important that you don’t spread a false sense of security and complacence, and definitely not advocate outdated technology. Worse, you have no discernible reason to advocate MD5 over at least SHA1.

    MD5 is certainly fine, and always will be, for detecting transmission errors etc (you first bullet sounds like you meant this).

    It is NOT fine for your second bullet (making sure your legal document has not been changed). That’s a document where someone might have an interest in changing it in specific ways while retaining the hash value.

    To take your WEP example, I’d rather switch when I suspect problems have started, then dig my heels in and be forced to switch at the last minute, when the attacks have become ‘1-minute on a normal PC’. Why you would advocate such an attitude is quite beyond me.

    Finally, the defense that something has ‘been in use for a long time for a reason’ is also quite silly. I don’t even know where to start debunking that one — but how about with aspirin, thalidomide, asbestos, and leaded petrol for starters?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s