Jeff Prosise
Checksums, cyclic redundancy checks, hash algorithms, and digital signatures are the keys to authenticating digitally transmitted data.
Imagine yourself in the following situation: A colleague just e-mailed you a memo containing confidential budget figures for the upcoming year. You have to be absolutely certain that the copy you received is genuine and that the numbers in it weren't altered during the transmission. A couple of transposed characters could cost your company millions of dollars. You're suspicious that the document may have been tampered with en route, because some of the numbers don't add up and the e-mail was routed through an external mail system. How do you know beyond a shadow of a doubt that the document you're looking at is identical to the document that was originally transmitted?
This scenario isn't as farfetched as it may seem. In a day and age when digital commerce is fast becoming a reality, one's confidence in digital transactions is wholly dependent upon the security surrounding such transactions. If you send a spreadsheet file to someone via e-mail or on a floppy disk, how does the recipient know that the spreadsheet wasn't altered by someone who had intermediate access to it? If you send a credit card number over the Internet, how does the recipient know that it really was you who placed the order?
The answers to these questions lie in a field of mathematics known as cryptography. The term cryptography implies encryption, but cryptography isn't limited to theories of data encryption. It also addresses issues related to digital authenticity--how you know that electronic data is real and how electronic documents can be "signed" with 1s and 0s that are analogous to handwritten signatures on paper. This article explores key principles of digital authenticity, beginning with simple methods for verifying the integrity of electronic data and concluding with a discussion of the U.S. government's proposed Digital Signature Standard and of digital signatures in general.
The simplest method of verifying the integrity of digitally transmitted data is the checksum method. A checksum is a value computed by adding together all the numbers in the input data. If the sum of all the numbers exceeds the highest value that a checksum can hold, the checksum equals the modulus of the total--that is, the remainder that's left over when the total is divided by the checksum's maximum possible value plus 1. In mathematical terms, a checksum is computed with the equation
Checksum = Total % (MaxVal + 1)
where Total equals the sum of the input data and MaxVal is the maximum checksum value you will allow.
Suppose the document whose contents you wish to verify is the following stream of 10 byte values:
36 211 163 4 109 192 58 247 47 92
If the checksum is also a 1-byte value, then it can't hold a number greater than 255. The sum of the values in the document is 1,159, so the 8-bit checksum is the remainder left when 1,159 is divided by 256, or 135. If the person who sent the document calculated a checksum of, say, 246, and you get a checksum of 135, then the data was altered. A checksum is the simplest form of digital fingerprint--a value, calculated from the content of other data, that changes if the data upon which it's based changes. Checksums have been used since the dawn of computing and are still the basis for error checking in some forms of the popular XMODEM file-transfer protocol.
The problem with checksums is that although conflicting checksums are proof positive that a document has been altered, matching checksums doesn't necessarily prove that the data was not altered. You can reorder the numbers in the document any way you want and the checksum won't change. Worse, you can change individual numbers in the document and tweak others so that the checksum comes out the same. When you use 8-bit checksums, there's a 1 in 256 chance that two completely random data streams will have the same checksums. Expanding the checksum length to 16 or 32 bits decreases the odds of coincidental matches, but checksums are still too susceptible to error to provide a high degree of confidence in the data they represent.
A better way to digitally fingerprint a block of data is to compute a cyclic redundancy check (CRC) value for it. CRCs have been used for years in network adapters, disk controllers, and other devices to verify that what goes in equals what comes out. Also, many modern communications programs use them to perform error checking on packets of data transmitted over phone lines.
The mathematics behind CRCs is based on something called polynomial division, in which each bit in a chunk of data represents one coefficient in a large polynomial. A polynomial is a mathematical expression like this one:
f(x) = 4x3 + x2 + 7
For the purpose of CRC calculations, the polynomial that corresponds to the byte value 85 (whose 8-bit binary equivalent is 01010101) is
0x7 + 1x6 + 0x5 + 1x4 + 0x3 + 1x2 + 0x1 + 1x0
or simply
x6 + x4 + x2 + 1
The key to CRC calculations is that polynomials can be multiplied and divided just like ordinary numbers. Dividing a "magic" polynomial (one whose coefficients are dictated by the CRC algorithm you're using) into the polynomial generated from a data stream yields a quotient polynomial and a remainder polynomial. The latter forms the basis of a CRC value. Like checksums, CRCs are small (usually 16 or 32 bits in length), but they're far more reliable at detecting minor changes in the input data. If just one bit in a large block of data changes, there's a 100 percent chance that the CRC will change, too. If two bits change, there's better than a 99.99 percent chance that a 16-bit CRC will catch the error. Swapping two bytes or adding 1 to one and subtracting 1 from another won't fool a CRC as it will a checksum.
CRCs are extremely useful for checking files downloaded from online services. When someone contacts me on ZD Net about a utility that's crashing inexplicably, the first thing I usually ask them to do is zip the file with PKZIP and type PKZIP-V to view the .ZIP file. One of the things they see is the 32-bit CRC value PKZIP computed for the uncompressed file. If the CRC computed from the downloaded utility is not the same as the CRC for the file I uploaded, we know that there was an undetected transmission error during the download. (It happens.)
You can use CRCs to fingerprint files yourself by downloading CRC.COM from PC Magazine Online. (It's in the Tutor library of ZD Net/CompuServe's Utilities/Tips forum and in the file V15N07.ZIP on our Internet server at http://www.pcmag.com.) CRC.COM is a DOS utility that takes a filename as input and computes a 32-bit CRC value from the file's contents. It uses the popular CRC-32 algorithm used by PKZIP and IBM Token-Ring network adapters. It's also very fast (written entirely in assembly language, it uses buffered disk I/O and a "smart" implementation of the CRC-32 algorithm to minimize the number of computations) and will handle files of any size. If you transmit a file to a friend via modem, CRC.COM is a great tool for making sure the data made it through intact.
The first thing you should do after downloading CRC.COM is run it on itself by typing
CRC CRC.COM
at the DOS prompt. If the resultant CRC value is not 86C237FA, then you should download the file again.
The problem with even a 32-bit CRC value is that although it's pretty good at detecting inadvertent changes to the input data (such as those introduced by transmission errors), it doesn't stand up very well to intentional attacks. If you use a 32-bit CRC to fingerprint a file, it's relatively easy for someone with access to a computer to generate a completely different file that produces the same CRC value.
A step beyond CRCs are one-way hash algorithms that produce "hash" values. "One way" means it's easy to input A and get B, but it's impossible--or nearly impossible--to work backward from B to A. A good hash algorithm has one very important property: The values that it generates are so unique and so difficult to duplicate that not even someone with a bank of Cray supercomputers and a few centuries to kill could find two sets of input data that produce the same hash value. Typically, hash values are at least 128 bits in length. The greater the length, the more difficult it is to reproduce the input or to find another set of input data that produces a matching result.
Two of the most widely known one-way hash algorithms are the MD5 message digest algorithm developed by MIT professor Ron Rivest (one of the developers of the highly regarded RSA public-key cryptosystem) and the Secure Hash Algorithm (SHA) developed by the National Institute of Standards and Technology (NIST) and the National Security Agency (NSA). MD5 produces a 128-bit digital fingerprint from a set of input data, and SHA produces a 160-bit value. Assuming no one discovers a heretofore unknown "trap door" to either algorithm, it is computationally unfeasible to take a hash value produced by SHA or a "message digest" produced by MD5 and work backward to find the input data. Thus, if someone sends you a file and an MD5 or SHA fingerprint generated from the file, and if you run the same hash algorithm on the file you receive and get the same result, you can be virtually certain the file was received intact.
Hash algorithms can be combined with public-key cryptosystems to produce digital signatures that guarantee the authenticity of a set of input data the same way a written signature verifies the authenticity of a printed document. A public-key cryptosystem is a method of encrypting and decrypting information that relies on two input keys: a public key that is freely disseminated to everyone and a private key that is known only to its holder.
Think of a key as a password. If Tom wants Sam to be able to send him an encrypted document, and neither Tom nor Sam wants to risk transmitting a password or key because it might be intercepted, Tom can give Sam his public key (see Figure 1). Sam encrypts the document with Tom's public key and sends it to Tom. Tom then decrypts the document with his private key, which is the only key that will decrypt a document encrypted with Tom's public key. It doesn't matter if someone has Tom's public key, because it's of no value in decrypting the document. And the private key, which is known only to Tom, doesn't have to be transmitted; theoretically, it never has to leave Tom's head. Conversely, some public-key cryptosystems allow Sam to encrypt the document with his own private key and then give his public key to Tom so Tom can unencrypt the document. Public-key cryptosystems such as RSA (the acronym was derived from the first letters of the surnames of the algorithm's three authors) already exist and are in wide use today.
How do digital signatures work? Here's one example. Suppose Sam wants to send Tom a contract or credit card number electronically and Tom needs an electronic signature to verify authenticity. First Sam sends the document. Then he uses a hash algorithm to generate a fingerprint for the document, encrypts the hash value with his private key, and sends the encrypted hash value to Tom. This is Sam's digital signature. Tom uses the same hash algorithm to fingerprint the document he received and then unencrypts the hash value he received from Sam using Sam's public key. If the two hash values match, then Tom not only knows that the document he received is authentic, but he also knows that Sam's signature is real. Conducting commercial transactions this way is arguably more secure than using paper signatures, because paper signatures can be forged. And if the information that Sam sent to Tom is sensitive (for example, if it contains a credit card number), then it, too, can be encrypted so that only Tom can read it.
This model--or one similar to it--is probably the one we'll use to conduct business on the Internet and elsewhere. It is the basis for the U.S. government's proposed Digital Signature Standard (DSS), which relies on the Secure Hash Algorithm to produce hash values and a public-key cryptosystem known as the Digital Signature Algorithm (DSA) to produce signatures from hash values. The DSS has been criticized for various reasons, but much of the criticism comes from parties who have a financial interest in seeing that it's not adopted.
Time will tell which, if any, method for creating digital signatures will become the standard. Regardless of the outcome, of greater importance is that it is possible to conduct electronic commerce in a secure fashion.
If you'd like to learn more about hash functions, public-key cryptosystems, and digital authenticity in general, I highly recommend Bruce Schneier's excellent book Applied Cryptography: Protocols, Algorithms, and Source Code in C (1995, John Wiley & Sons). The recently published second edition is the most authoritative work of its kind anywhere, and it's loaded with useful information that includes detailed descriptions of the algorithms, the mathematics behind them, and the efforts to crack them.
For more information on CRCs and how to write code that generates them, check out W. David Schwaderer's C Programmer's Guide to NetBIOS (1988, Howard W. Sams & Company) and Joe Campbell's C Programmer's Guide to Serial Communications (1987, Howard W. Sams & Company). I've used these two books for years and find them to be invaluable resources for understanding the principles of digital data transmission and authentication.
Jeff Prosise is a contributing editor of PC Magazine. His new book, Programming Windows 95 with MFC (Microsoft Press), is due out this spring.
Copyright (c) 1996
Ziff-Davis Publishing Company