created: 11/14/2007; revised 07/04/2017
This section is about null-terminated strings. Many of the puzzles are similar to the functions in the standard strings library of most C environments.
A null-terminated string is a sequence of bytes in memory where each byte encodes a single character using ASCII. The last character is followed by a byte containing all zero bits. That byte is called a null byte. For example, the string "The game is afoot!" is stored as:
The slash in the last cell represents the null byte. The empty cells represent the space character, represented in ASCII as the pattern 0x20.
Historically, C programs have encoded characters as bit patterns using the ASCII encoding scheme. Although various libraries use other encoding schemes, ASCII remains the most popular and is used in these puzzles.
The following table shows the ASCII encoding scheme.
Each character is encoded as an 8-bit pattern. The table shows the pattern
using the hexadecimal name for the pattern. The column heading gives the 1's
place digit of the hex and the row heading gives the 16's place digit of the
hex. For example, find the character 'K' in the central (white) portion of
the chart. The column heading is B and the row heading is 4 so the bit pattern
is 0x4B, or 0100 1011
.
1's place | ||||||||||||||||
16's place |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F |
0 | NUL | SOH | STX | ETX | EOT | ENQ | ACK | BEL | BS | HT | LF | VT | FF | CR | SO | SI |
1 | DLE | DC1 | DC2 | DC3 | DC4 | NAK | SYN | ETB | CAN | EM | SUB | ESC | FS | GS | RS | US |
2 | SP | ! | " | # | $ | % | & | ' | ( | ) | * | + | , | - | . | / |
3 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | : | ; | < | = | > | ? |
4 | @ | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O |
5 | P | Q | R | S | T | U | V | W | X | Y | Z | [ | \ | ] | ^ | _ |
6 | ` | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o |
7 | p | q | r | s | t | u | v | w | x | y | z | { | | | } | ~ | DEL |
Some characters in the chart are control characters, which
in historic times were used to send commands to electro-mechanical equipment like teletypes
and printers. Control characters do not correspond to printed characters and
are shown in the chart as abbreviations. For example, BEL (0x07
)
commands a device to sound its bell. Modern equipment will sometimes
sound a beep when it is sent BEL.
Line feed is 0x0A
and is still used to signify the end of a line for text contained
in main memory. When text is stored in a disk file, end of lines are signified
by LF (in Unix systems), CR-LF (in Microsoft systems), or CR (older Apple systems).
Character SP (0x20
) is the space character.
Character HT (0x09
) is the horizontal tab. NUL
is the pattern 0x00
— the null byte placed at the end of strings.
Often it is convenient to regard the bit pattern that encodes a character as
encoding an integer. For example, the bit pattern the represents 'A' is
0x41
or 01000001
. This pattern could also be regarded as
the base 2 encoding of an integer. Adding one to this integer yields 01000010
which is the ASCII encoding for 'B'. The following code fragment outputs the
sequence 'A', 'B', 'C', 'D':
int ch = 'A' ; putchar( ch ); putchar( ch+1 ); putchar( ch+2 ); putchar( ch+3 );
In the first statement, the byte encoding character 'A' is put into the low
order byte of the int
variable ch
because putchar()
requires an int
argument.
Often you want to convert a digit character into the corresponding integer. For example, you might want to convert the character '1' into the integer value one. The following does not work:
int value; value = '1';
because it will place the ASCII encoding for character '1' into the low order byte of the variable
0000 0000 0000 0000 0000 0000 0011 0001(assuming ints are 32 bits), rather than the desired pattern for the integer one:
0000 0000 0000 0000 0000 0000 0000 0001
To get the desired pattern into the variable, do this:
int value; value = '1' - '0';
Since the ASCII encoding for '1' immediately follows the encoding for '0', the subtraction evaluates to integer one.
Strings are accessed through pointers. The printf()
function from the standard I/O library prints a string to the monitor. For
example,
printf("The game is afoot!")
prints its string to the monitor. At run time, the string literal "The
game is afoot!"
is a null terminated string placed somewhere in
memory. The actual argument sent to printf()
is a pointer to the
first byte of that string:
The effect is the same as the following:
char *p; p = "The game is afoot!"; printf( p );
In the above code, the variable p
contains a pointer to a char
.
To increment the pointer to point to the next char, add one to it. The following
code:
char *p; p = "The game is afoot!"; p++ ; printf( p );
will print out "he game is afoot!"
.
A string literal is a sequence of characters contained between quote marks, like "The game is afoot!" The compiler allocates memory for it, puts the characters followed by a null byte into the memory, and provides a pointer to the first byte.
Bug Alert! In standard C, string literals are constants. They cannot be altered. The following is illegal:
char *p = "The game is afoot!"; *p = 'X' ;
The fragment is attempting to put 'X' into the first byte of the literal, but changing the literal is illegal. The compiler might not catch the error, so you will have a mysterious run-time error instead.
Unfortunately, many compilers do not enforce this, and the above code will compile and run. Worse, many C programs have been written with the expectation that string literals can be altered.
Often a program will allocate a generous amount of memory for a character array, more bytes than any string is expected to use. The various strings that the program manipulates are stored at the front of the array (and always terminated with a null byte). The array is used as a staging area, and is often called a string buffer. For example the following uses a buffer of 1024 characters which is plenty of room for the strings it is dealing with:
char buffer[1024]; strcpy( buffer, "SHORT String" ); strcat( buffer, " MADE Longer"); strToLower( buffer );
The first function copies characters into the buffer. Although the string literal itself cannot be changed, its characters can be copied into a buffer and the buffer can be changed. The second function appends additional characters to the string in the buffer by copying characters from a second literal. The third function converts characters in the buffer to lower case. The buffer is not a string literal, so the characters it contains can be altered.