Wide Characters and C

To a C programmer, the whole idea of 16-bit characters can certainly provoke uneasy chills. That a char is the same width as a byte is one of the very few certainties of this life. Few programmers are aware that ANSI/ISO 9899-1990, the "American National Standard for Programming Languages—C" (also known as "ANSI C") supports character sets that require more than one byte per character through a concept called "wide characters." These wide characters coexist nicely with normal and familiar characters.

ANSI C also supports multibyte character sets, such as those supported by the Chinese, Japanese, and Korean versions of Windows. However, these multibyte character sets are treated as strings of single-byte values in which some characters alter the meaning of successive characters. Multibyte character sets mostly impact the C run-time library functions. In contrast, wide characters are uniformly wider than normal characters and involve some compiler issues.

Wide characters aren't necessarily Unicode. Unicode is one possible wide-character encoding. However, because the focus in this book is Windows rather than an abstract implementation of C, I will tend to speak of wide characters and Unicode synonymously.

The char Data Type

Presumably, we are all quite familiar with defining and storing characters and character strings in our C programs by using the char data type. But to facilitate an understanding of how C handles wide characters, let's first review normal character definition as it might appear in a Win32 program.

The following statement defines and initializes a variable containing a single character:

char c = `A' ;

The variable c requires 1 byte of storage and will be initialized with the hexadecimal value 0x41, which is the ASCII code for the letter A.

You can define a pointer to a character string like so:

char * p ;

Because Windows is a 32-bit operating system, the pointer variable p requires 4 bytes of storage. You can also initialize a pointer to a character string:

char * p = "Hello!" ;

The variable p still requires 4 bytes of storage as before. The character string is stored in static memory and uses 7 bytes of storage—the 6 bytes of the string in addition to a terminating 0.

You can also define an array of characters, like this:

char a[10] ;

In this case, the compiler reserves 10 bytes of storage for the array. The expression sizeof (a) will return 10. If the array is global (that is, defined outside any function), you can initialize an array of characters by using a statement like so:

char a[] = "Hello!" ;

If you define this array as a local variable to a function, it must be defined as a static variable, as follows:

static char a[] = "Hello!" ;

In either case, the string is stored in static program memory with a 0 appended at the end, thus requiring 7 bytes of storage.

Wider Characters

Nothing about Unicode or wide characters alters the meaning of the char data type in C. The char continues to indicate 1 byte of storage, and sizeof (char) continues to return 1. In theory, a byte in C can be greater than 8 bits, but for most of us, a byte (and hence a char) is 8 bits wide.

Wide characters in C are based on the wchar_t data type, which is defined in several header files, including WCHAR.H, like so:

typedef unsigned short wchar_t ;

Thus, the wchar_t data type is the same as an unsigned short integer: 16 bits wide.

To define a variable containing a single wide character, use the following statement:

wchar_t c = `A' ;

The variable c is the two-byte value 0x0041, which is the Unicode representation of the letter A. (However, because Intel microprocessors store multibyte values with the least-significant bytes first, the bytes are actually stored in memory in the sequence 0x41, 0x00. Keep this in mind if you examine memory storage of Unicode text.)

You can also define an initialized pointer to a wide-character string:

wchar_t * p = L"Hello!" ;

Notice the capital L (for long) immediately preceding the first quotation mark. This indicates to the compiler that the string is to be stored with wide characters—that is, with every character occupying 2 bytes. The pointer variable p requires 4 bytes of storage, as usual, but the character string requires 14 bytes—2 bytes for each character with 2 bytes of zeros at the end.

Similarly, you can define an array of wide characters this way:

static wchar_t a[] = L"Hello!" ;

The string again requires 14 bytes of storage, and sizeof (a) will return 14. You can index the a array to get at the individual characters. The value a[1] is the wide character `e', or 0x0065.

Although it looks more like a typo than anything else, that L preceding the first quotation mark is very important, and there must not be space between the two symbols. Only with that L will the compiler know you want the string to be stored with 2 bytes per character. Later on, when we look at wide-character strings in places other than variable definitions, you'll encounter the L preceding the first quotation mark again. Fortunately, the C compiler will often give you a warning or error message if you forget to include the L.

You can also use the L prefix in front of single character literals, as shown here, to indicate that they should be interpreted as wide characters.

wchar_t c = L'A' ;

But it's usually not necessary. The C compiler will zero-extend the character anyway.

Wide-Character Library Functions

We all know how to find the length of a string. For example, if we have defined a pointer to a character string like so:

char * pc = "Hello!" ;

we can call

iLength = strlen (pc) ;

The variable iLength will be set equal to 6, the number of characters in the string.

Excellent! Now let's try defining a pointer to a string of wide characters:

wchar_t * pw = L"Hello!" ;

And now we call strlen again:

iLength = strlen (pw) ;

Now the troubles begin. First, the C compiler gives you a warning message, probably something along the lines of

`function' : incompatible types - from `unsigned short *' to `const char *'

It's telling you that the strlen function is declared as accepting a pointer to a char, and it's getting a pointer to an unsigned short. You can still compile and run the program, but you'll find that iLength is set to 1. What happened?

The 6 characters of the character string "Hello!" have the 16-bit values:

0x0048 0x0065 0x006C 0x006C 0x006F 0x0021

which are stored in memory by Intel processors like so:

48 00 65 00 6C 00 6C 00 6F 00 21 00

The strlen function, assuming that it's attempting to find the length of a string of characters, counts the first byte as a character but then assumes that the second byte is a zero byte denoting the end of the string.

This little exercise clearly illustrates the differences between the C language itself and the run-time library functions. The compiler interprets the string L"Hello!" as a collection of 16-bit short integers and stores them in the wchar_t array. The compiler also handles any array indexing and the sizeof operator, so these work properly. But run-time library functions such as strlen are added during link time. These functions expect strings that comprise single-byte characters. When they are confronted with wide-character strings, they don't perform as we'd like.

Oh, great, you say. Now every C library function has to be rewritten to accept wide characters. Well, not every C library function. Only the ones that have string arguments. And you don't have to rewrite them. It's already been done.

The wide-character version of the strlen function is called wcslen ("wide-character string length"), and it's declared both in STRING.H (where the declaration for strlen resides) and WCHAR.H. The strlen function is declared like this:

size_t __cdecl strlen (const char *) ;

and the wcslen function looks like this:

size_t __cdecl wcslen (const wchar_t *) ;

So now we know that when we need to find out the length of a wide-character string we can call

iLength = wcslen (pw) ;

The function returns 6, the number of characters in the string. Keep in mind that the character length of a string does not change when you move to wide characters—only the byte length changes.

All your favorite C run-time library functions that take string arguments have wide-character versions. For example, wprintf is the wide-character version of printf. These functions are declared both in WCHAR.H and in the header file where the normal function is declared.

Maintaining a Single Source

There are, of course, certain disadvantages to using Unicode. First and foremost is that every string in your program will occupy twice as much space. In addition, you'll observe that the functions in the wide-character run-time library are larger than the usual functions. For this reason, you might want to create two versions of your program—one with ASCII strings and the other with Unicode strings. The best solution would be to maintain a single source code file that you could compile for either ASCII or Unicode.

That's a bit of a problem, though, because the run-time library functions have different names, you're defining characters differently, and then there's that nuisance of preceding the string literals with an L.

One answer is to use the TCHAR.H header file included with Microsoft Visual C++. This header file is not part of the ANSI C standard, so every function and macro definition defined therein is preceded by an underscore. TCHAR.H provides a set of alternative names for the normal run-time library functions requiring string parameters (for example, _tprintf and _tcslen). These are sometimes referred to as "generic" function names because they can refer to either the Unicode or non-Unicode versions of the functions.

If an identifier named _UNICODE is defined and the TCHAR.H header file is included in your program, _tcslen is defined to be wcslen:

#define _tcslen wcslen

If UNICODE isn't defined, _tcslen is defined to be strlen:

#define _tcslen strlen

And so on. TCHAR.H also solves the problem of the two character data types with a new data type named TCHAR. If the _UNICODE identifier is defined, TCHAR is wchar_t:

typedef wchar_t TCHAR ;

Otherwise, TCHAR is simply a char:

typedef char TCHAR ;

Now it's time to address that sticky L problem with the string literals. If the _UNICODE identifier is defined, a macro called __T is defined like this:

#define __T(x) L##x

This is fairly obscure syntax, but it's in the ANSI C standard for the C preprocessor. That pair of number signs is called a "token paste," and it causes the letter L to be appended to the macro parameter. Thus, if the macro parameter is "Hello!", then L##x is L"Hello!".

If the _UNICODE identifier is not defined, the __T macro is simply defined in the following way:

#define __T(x) x

Regardless, two other macros are defined to be the same as __T:

#define _T(x) __T(x)
#define _TEXT(x) __T(x)

Which one you use for your Win32 console programs depends on how concise or verbose you'd like to be. Basically, you must define your string literals inside the _T or _TEXT macro in the following way:

_TEXT ("Hello!") 

Doing so causes the string to be interpreted as composed of wide characters if the _UNICODE identifier is defined and as 8-bit characters if not.