Post without Account — your post will be reviewed, and if appropriate, posted under Anonymous. You can also use this link to report any problems registering or logging in.

Null-terminated strings in C

  • 0 Replies

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 715
    • View Profile
Null-terminated strings in C
« October 17, 2019, 01:05:45 PM »
In 1972, Dennis Ritchie defined a string in the C language as an array of bytes (characters), until you encounter a byte of value 0 (null byte) to terminate the string. Multibyte character systems may (e.g., UTF-8) or may not (e.g., UTF-16) be careful to avoid 0 bytes that would prematurely end the string. Arbitrary binary data that may include a 0 byte must be avoided, at least if you are going to use the standard string functions, which are looking for a null byte to end the string. There is no reason that a string (including the null terminator) has to completely fill the allocated array, but it certainly cannot be any longer.

This brings up the problem of trying to fill a string with more data than was allocated to the array in the first place. Particularly insidious is forgetting to account for the null terminator byte when sizing the array, and ending up writing the null byte one beyond the end of the array, where who knows what it's overwriting. A compiler initializing a string may notice this, but don't count on it! You may see something like
Code: [Select]
  #define STRINGLEN  100  /* 100 bytes of string content */
  char MyString[STRINGLEN+1];  /* space for terminator */
  strncpy(MyString, SrcString, STRINGLEN);  /* max STRINGLEN including \0 */

An even worse problem is using string functions (such as byte copies or string concatenations) that fail to check whether the array is long enough to hold the desired string (again, including that null terminator). There was really no excuse to define any string functions that fail to know about the array length, but that's what was done. You're taking a long walk off a short pier! The all-too-common result is the bytes of a string being written past the end of the character array, overwriting other data or even code. This has been exploited many times in buffer overflow attacks.

In the above code snippet, note the use of STRINGLEN in string calls to try to avoid overflow problems. It's better than nothing… but can still leave you with an unterminated string (no 0 byte) if SrcString has more than STRINGLEN-1 bytes of data (before its terminating null byte)!
Code: [Select]
  MyString[STRINGLEN] = '\0';
might be added after the strncpy() to take care of that problem (or anywhere before, on the assumption that it won't be overwritten by a string operation). Note that the index is not STRINGLEN+1, as that would be beyond the end of the array! Even with this fix, one character (byte) might be lost in making this a proper string.

The best solution would be an object to hold the array of bytes, along with the current array length and perhaps the current string length (less the null terminator). However, if you're working in C, it's likely that you don't have real objects, and at best, have to manually cart around the associated lengths and make sure you don't accidentally overwrite them, as well as avoiding the blind use of most string functions in the standard library. Perhaps a pseudo-object can be placed on the heap, with a single pointer to the byte array and associated data (lengths). There could be wrapper functions around all naïve native string functions, that would first check if there is sufficient array space to hold the end result. There's no harm done in manually tracking the actual length of the string (provided that your wrapper function updates it) and keeping a terminating null for the use of standard library functions.

Naturally, introducing additional checking like this will slow down the code, but may be worth it to avoid nasty buffer overflow errors. If you're writing in C in the first place, it's likely for the raw performance needed for real-time data processing (e.g., video conversion), and you can't afford a lot of sanity checking. In that case, it may be a worthwhile tradeoff to develop the code using macros and functions that do a lot of checking and verification, and then (for production) switch to lighter-weight macros that don't do such checking, and hope that your thorough code testing has found all the problem areas! Your debug/development code might even issue a run-time warning if switching to the faster (unchecked) code could produce a buffer overflow.