Strings are a fundamental concept in software engineering, but they are not a built-in type in C. Null-terminated byte strings (NTBS) consist of a contiguous sequence of characters terminated by and including the first null character and are supported in C as the format used for string literals. The C programming language supports single-byte character strings, multibyte character strings, and wide character strings. Single-byte and multibyte character strings are both described as null-terminated byte strings, which are also called narrow character strings.

A pointer to a null-terminated byte string points to its initial character. The length of the string is the number of bytes preceding the null character, and the value of the string is the sequence of the values of the contained characters, in order.

A wide string is a contiguous sequence of wide characters (of type wchar_t) terminated by and including the first null wide character. A pointer to a wide string points to its initial (lowest addressed) wide character. The length of a wide string is the number of wide characters preceding the null wide character, and the value of a wide string is the sequence of code values of the contained wide characters, in order.

Null-terminated byte strings are implemented as arrays of characters and are susceptible to the same problems as arrays. As a result, rules and recommendations for arrays should also be applied to null-terminated byte strings.

The C standard uses the following philosophy for choosing character types, though it is not explicitly stated in one place.

signed char and unsigned char

"plain" char

int

Note that the two different ways a character is used as an int (as an unsigned char + EOF or as a plain char converted to int) can lead to confusion. For example, isspace('\200') results in undefined behavior when char is signed.

unsigned char

Unlike other integer types, unsigned char has the unique property that

values stored in [...] objects of type unsigned char shall be represented using a pure binary notation. (C11, Section 6.2.1 [ISO/IEC 9899:2011])

where a pure binary notation is defined as the following:

A positional representation for integers that uses the binary digits 0 and 1, in which the values represented by successive bits are additive, begin with 1, and are multiplied by successive integral powers of 2, except perhaps the bit with the highest position. A byte contains CHAR_BIT bits, and the values of type unsigned char range from 0 to 2 CHAR_BIT − 1. (Section 6.2.1, fn. 49)

That is, objects of type unsigned char may have no padding bits and consequently no trap representation. As a result, non-bit-field objects of any type may be copied into an array of unsigned char (for example, via memcpy()) and have their representation examined one byte at a time.

wchar_t

Risk Assessment

Understanding how to represent characters and character strings can eliminate many common programming errors that lead to software vulnerabilities.

Recommendation

Severity

Likelihood

Remediation Cost

Priority

Level

STR00-C

medium

probable

low

P12

L1

Related Vulnerabilities

Search for vulnerabilities resulting from the violation of this rule on the CERT website.

Related Guidelines

CERT C++ Secure Coding Standard: STR00-CPP. Represent characters using an appropriate type

ISO/IEC TR 24731-1:2007

ISO/IEC 9899:2011 Section 7.1.1, "Definitions of terms," and Section 7.24, "String handling <string.h>"

Bibliography

[Seacord 2005a] Chapter 2, "Strings"