Chapter 2

An Introduction to Unicode

In the first chapter, I promised to elaborate on any aspects of C that you might not have encountered in conventional character-mode programming but that play a part in Microsoft Windows. The subject of wide-character sets and Unicode almost certainly qualifies in that respect.

Very simply, Unicode is an extension of ASCII character encoding. Rather than the 7 bits used to represent each character in strict ASCII, or the 8 bits per character that have become common on computers, Unicode uses a full 16 bits for character encoding. This allows Unicode to represent all the letters, ideographs, and other symbols used in all the written languages of the world that are likely to be used in computer communication. Unicode is intended initially to supplement ASCII and, with any luck, eventually replace it. Considering that ASCII is one of the most dominant standards in computing, this is certainly a tall order.

Unicode impacts every part of the computer industry, but perhaps most profoundly operating systems and programming languages. In this respect, we are almost halfway there. Windows NT supports Unicode from the ground up. (Unfortunately, Windows 98 includes only a small amount of Unicode support.) The C programming language as formalized by ANSI inherently supports Unicode through its support of wide characters, which I'll discuss in detail below.

Of course, as usual, we as programmers are confronted with much of the dirty work. I've tried to ease the load by making all of the programs in this book "Unicode-ready." What this means exactly will become more apparent as I discuss Unicode in this chapter.