Admin Production ni-theme
Current Publication

Programming for UTF-8 Encoding

LabWindows/CVI

Programming for UTF-8 Encoding

UTF-8 is a type of Unicode encoding that can represent all characters in the Unicode standard. UTF-8 is a variable-width type of encoding where characters are composed of one to four bytes. UTF-8 is backwards compatible with ASCII.

Here are some terms to know when working with UTF-8 encoding:

  • Single-byte character—UTF-8 character composed of only one byte. The first 128 UTF-8 characters are single-byte and correspond to ASCII.
  • Lead byte—First byte of a UTF-8 character. The lead byte encodes information about the number of expected trail bytes to follow.
  • Trail byte—Remaining byte(s) of a UTF-8 character. A UTF-8 character can have up to three trail bytes. Trail byte(s) can be identified by their value alone, unlike ANSI encoding where the preceding lead byte is used to identify the trail byte(s).
LabWindows/CVI detects and reports non-fatal runtime errors when a LabWindows/CVI library function receives a string whose bytes do not represent a valid UTF-8 encoding. See User Protection for more information about non-fatal user protection errors.

String Handling

When you manipulate strings with UTF-8 characters, treat the lead byte and the trail byte(s) as a single unit. This affects every instance in your program where characters or strings are handled.

Keep in mind the difference between the length of a string measured in bytes versus the length of a string measured in characters. In many instances, the number of bytes should be used. For example, when you allocate a buffer for storing a string, consider the string length in bytes since every memory storage location of a character must allow for up to four-bytes. Use the ANSI C Library function strlen, which returns the number of bytes in a string, to measure the length of the string. In other cases, you must replace all ANSI C Library functions that take string parameters with the Multibyte Character functions listed in the ANSI C Library Function Tree topic or the macros described in the Multibyte Macros and Functions in toolbox.h topic. For example, if you want to compare UTF-8 strings in a case-insensitive manner, you must replace stricmp with _mbsicmp. The capital and lower-case versions of a character, ä and Ä, have different bytes. Because the stricmp function simply compares bytes, ä and Ä would be interpreted as different characters.

Note Note  Refer to the EVENT_KEYPRESS topic for information about processing keypress events that result from UTF-8 input.

Write all of your string processing code in a UTF-8-aware manner. For example, pointers should usually indicate the start of a character and indices should always reference the start of a character. Use the CmbStrInc or CmbStrDec macros in toolbox.h instead of the ++ and –– operators to modify the value of pointers into your strings. Process strings sequentially, from left to right, rather than randomly. Accessing random characters in a multibyte string is computationally expensive and can be error prone.

The following code example counts the number of non-ASCII characters in a text string. Before UTF-8 changes, your code might look like the following example:

size_t CVIFUNC CountNonAsciiCharacters (const char *text)
{
  size_t index;
  size_t length = strlen(text);
  size_t count = 0;
  for (index = 0; index < length; index++)
    if (text[index] < 0)
      count++;
  return count;
}

After UTF-8 changes, your code might look like the following example:

size_t CVIFUNC CountNonAsciiCharacters (const char *text)
{
  const unsigned char *textPtr = (unsigned char *)text;
  CmbChar character;
  size_t count = 0;
  for (character = CmbGetC(textPtr); character != 0; character = CmbIncGetC(textPtr))
  {
    if (!CmbIsSingleC(character))
      count++;
  }
  return count;
}

Interacting with ANSI Functions

If your code includes ANSI functions that take string parameters but do not have UTF-8 support, add code to convert strings from UTF-8 to ANSI and vice-versa before and after the ANSI functions, respectively. Use the String Conversion functions listed in the Utility Library Function Tree to convert your strings. Information may be lost during the conversion process since only a subset of UTF-8 characters can be represented with ANSI encoding.

Interacting with Non-UTF-8 Unicode Functions

If your code includes non-UTF-8 Unicode functions that take string parameters, add code to convert strings from UTF-8 to the supported Unicode encoding and vice-versa before and after the Unicode functions, respectively. For example, Windows SDK functions support UTF-16, or wide characters. The wide character functions are indicated with a W suffix in Windows SDK. Use the MultiByteToWideChar and WideCharToMultiByte Windows SDK functions or the Multibyte Character functions listed in the ANSI C Library Function Tree to convert your strings.

Refer to sdk/verinfo/verinfos.cws for an example of converting between wide characters and UTF-8 when using Windows SDK functions.