Learning About Computers and the Internet
Tips Blog WinXP Internet Computing Downloads Vista/7 Home

Cast of Characters- ASCII, ANSI, UTF-8 and all that
How to keep straight the various ways that computers encode characters.

Since computers do everything in binary code, the letters and other characters that we use are actually encoded as a number on a computer. Because there are various ways to do this encoding, things can sometimes become confusing when creating or reading documents that are used on many computers and in different parts of the world. For example, Windows may encode certain characters differently from Linux or Mac systems. Also, non-English languages may have characters not present in your computer's particular repertoire. Then there are special symbols like those in mathematics or currencies. In order to help bring order to all this, there are international standards under the general oversight of the Unicode Consortium but for historical, national, and proprietary reasons there are many inconsistencies and complications. For these reasons every computer user should know a little bit about how computers represent characters. This article will try to clarify the different ways that characters are handled in computers and discuss some of the problems that can occur.

History and Background of Character Encoding

Encoding characters goes back a long way. Various methods such as smoke signals, mirror flashes, and flags have been used since ancient times. A more modern but pre-computer example is the Morse code system of long and short electrical pulses used to send telegrams. As technology advanced in the 19th and early 20th century, other systems such as the Hollerith punched card came into use. When the digital computer first came on the scene in the middle of the 20th century. it was, as its name suggests, a number-cruncher with little capacity for handling anything other than numerical data. However, the need to handle more than numbers soon gave rise to a system for encoding characters that became known as ASCII.

ASCII

It wasn't just computers that needed a system for encoding characters. Communications equipment such as teletype machines also required a method of representing letters and other characters. Various schemes were used but a system called ASCII (after American Standard Code for Information Interchange) came into general use and was used as the basis for the early computer encoding. Bits were precious back then and ASCII is a 7-bit system. This has led to some confusion because today almost everything is done in 8-bit bytes (octets). ASCII is still used but generally uses 8-bit code with the most significant bit set to zero.

With 7 bits, 128 numbers (0-127 in decimal notation) are available to code characters. Fewer are actually used because 0-31 and 127 were set aside for unprintable device controls like "line feed", "bell", and "carriage return". (Recall that ASCII was devised for a variety of devices, including the teletype.) The printable repertoire (including the character for "space" ) is assigned to code points 32-126. The 95 printable characters are shown in Figure I. The characters correspond pretty much to what was available on typewriter keyboards of the day. Although these are the official encodings that have been incorporated in the international system defined by Unicode, there have been some national variations and these particular encodings are sometimes called US-ASCII. Generally, however, the national variations are no longer used.

Figure I. Printable ASCII charactersASCII characters

Note a somewhat tricky point that you have to keep in mind. A number that is used by the computer to code a character is just that - code for displaying or printing something. For example, the digit "7" shown above has ASCII code number "55" (in decimal). This coding refers to something that is used to display a character and is not the same as an actual piece of numerical data that is used in arithmetical calculations. Another point is that the actual physical appearance of a character depends on what font you are using and what medium is used for the display.

The above encoding (plus the control codes) defines the strict meaning of ASCII. Unfortunately, the term "ASCII" is used loosely and does not always have the same meaning. Sometimes it means plain text that has no formatting, even when non-ASCII characters are present. Sometimes, as with FTP software, it means anything that is not a binary file.

Other variations on ASCII are some that use the control code points for printable characters. Since computers do not need all the controls that devices like teletypes use, some of the reserved points 0-31 were assigned instead to printable characters. These assignments varied and were defined by including a so-called "code page" to define the characters. For example, a code page is still present in Windows for use in the command line. For the complications that causes, see this reference.

ISO Latin 1 (ISO 8859-1)

ASCII has a very limited repertoire and was soon expanded to an 8-bit system that has 256 code points, 0-255. Added to the ASCII characters are various letters needed for writing languages of Western Europe and certain special characters. This encoding is called ISO Latin-1 or ISO 8859-1, "ISO" coming from the International Organization for Standardization. Microsoft applications also refer to this encoding as "Western European (ISO)". The additional characters occupy code positions 160 - 255 and are shown in Figure II. Code positions 128 - 159 are explicitly reserved for control purposes.

Figure II. Addiional printable characters in Latin-1Latin-1 characters

This particular coding is very common and is one of those most used on the Internet. To confuse matters, however, there is a related encoding that is not official but is used by many Windows systems. Because Windows is so widespread, this encoding is also common. It is discussed next.

ANSI (Windows-1252)

This 8-bit Microsoft-specific encoding is not part of the official Unicode standards but is common because of Microsoft's dominance. It is the same as ISO Latin-1 with one big exception. Positions 128-159 in ISO Latin-1 are reserved for controls, but the Microsoft encoding uses most of them for printable characters. This Microsoft variation is called variously ANSI, Windows-1252, or Windows Latin-1. Microsoft applications also sometimes use the name "Western European (Windows)". Just to confuse matters, some people even call this encoding "ASCII". Detailed tables of the encoding are at numerous references, including Alan Wood's site.

Note that an application that expects a file to be encoded according to ISO Latin-1 will not render correctly the characters corresponding to code points 128-159. For example, files saved in Notepad often use ANSI and that can sometimes lead to problems.

UTF-8 (8-bit UnicodeTransformation Format)

So far we have only considered character encoding appropriate for Western European languages. However, even in Europe there are other alphabets such as Greek and Cyrillic and the rest of the world has many more as well as ideographic writing. The general rules for encoding the languages of the world as well as many specialized symbols are established by the Unicode Consortium. One 8-bit byte is no longer sufficient when coding the many languages and the most common way of representing the Unicode standards uses a a variable number of 8-bit blocks or octets and is called UTF-8. From one to to four octets can be used but the old ASCii and ISO encodings are preserved with the use of a single octet. This backward compatibility is very useful and is one reason for the wide use of UTF-8 on the Internet.

Another system using 16-bit units and called UTF-16 is also in use. For example, Windows XP uses it internally. However, it is not usually encountered in ordinary use of a computer.

Example of encoding problem - Unexpected characters in Notepad

Periodically someone posts something on the Internet about how mysterious messages appear in the Windows accessory Notepad. What appears to be a mystery is the result of Notepad mistaking which character encoding is being used. Raymond Chen explains what is happening in detail at this link and gives more explanation at this link.

Using character codes

Provided that your system has fonts that support them, various characters not found on a regular keyboard can be entered into documents by using character codes in certain ways. I discuss some of these on another page as well as using the Windows Character Map.


 << Home page ©2002-2016 Victor Laurie    Home page >>