Some inputs are from untrustable users, so those inputs must be validated (filtered) before being used. You should determine what is legal and reject anything that does not match that definition. Do not do the reverse (identify what is illegal and reject those cases), because you are likely to forget to handle an important case. Limit the maximum character length (and minimum length if appropriate), and be sure to not lose control when such lengths are exceeded (see the buffer overflow section below for more about this).
For strings, identify the legal characters or legal patterns (e.g., as a regular expression) and reject anything not matching that form. There are special problems when strings contain control characters (especially linefeed or NIL) or shell metacharacters; it is often best to ``escape'' such metacharacters immediately when the input is received so that such characters are not accidentally sent. CERT goes further and recommends escaping all characters that aren't in a list of characters not needing escaping [CERT 1998, CMU 1998]. see the section on ``limit call-outs to valid values'', below, for more information.
Limit all numbers to the minimum (often zero) and maximum allowed values. Filenames should be checked; usually you will want to not include ``..'' (higher directory) as a legal value. In filenames it's best to prohibit any change in directory, e.g., by not including ``/'' in the set of legal characters. A full email address checker is actually quite complicated, because there are legacy formats that greatly complicate validation if you need to support all of them; see mailaddr(7) and IETF RFC 822 [RFC 822] for more information if such checking is necessary.
These tests should usually be centralized in one place so that the validity tests can be easily examined for correctness later.
Make sure that your validity test is actually correct; this is particularly a problem when checking input that will be used by another program (such as a filename, email address, or URL). Often these tests are have subtle errors, producing the so-called ``deputy problem'' (where the checking program makes different assumptions than the program that actually uses the data).
The following subsections discuss different kinds of inputs to a program; note that input includes process state such as environment variables, umask values, and so on. Not all inputs are under the control of an untrusted user, so you need only worry about those inputs that are.
Many programs use the command line as an input interface, accepting input by being passed arguments. A setuid/setgid program has a command line interface provided to it by an untrusted user, so it must defend itself. Users have great control over the command line (through calls such as the execve(3) call). Therefore, setuid/setgid programs must validate the command line inputs and must not trust the name of the program reported by command line argument zero (the user can set it to any value including NULL).
By default, environment variables are inherited from a process' parent. However, when a program executes another one it can set the environment variables to arbitrary values. This is dangerous to setuid/setgid programs, because their invoker can control their environment variables, sending them on. Since they are usually inherited, this also applies transitively.
Environment variables are stored in a format that allows multiple values with the same field (e.g., two SHELL values). While typical command shells prohibit doing this, a cracker can create such a situation; some programs may test one value but use a different one in this case. Even worse, many libraries and programs are controlled by environment variables in ways that are obscure, subtle, or even undocumented. For example, the IFS variable is used by the sh and bash shell to determine which characters separate command line arguments. Since the shell is invoked by several low-level calls, setting IFS to unusual values can subvert apparently-safe calls.
For secure setuid/setgid programs, the short list of environment variables needed as input (if any) should be carefully extracted. Then the entire environment should be erased by setting the global variable environ to NULL, followed by resetting a small set of necessary environment variables to safe values (not values from the user). These values usually include PATH (the list of directories to search for programs, which should not include the current directory), IFS (to its default of `` \t\n''), and TZ (timezone).
A program is passed a set of ``open file descriptors,'' that is, pre-opened files. A setuid/setgid program must deal with the fact that the user gets to select what files are open and to what (within their permission limits). A setuid/setgid program must not assume that opening a new file will always open into a fixed file descriptor id. It must also not assume that standard input, standard output, and standard error refer to a terminal or are even open.
If a program takes directions from a given file, it must not give it special trust unless only a trusted user can control its contents. This means that an untrusted user must not be able to modify the file, its directory, or any of its parent directories. Otherwise, the file must be treated as suspect.
CGI inputs are internally a specified set of environment variables and standard input. These values must be validated.
One additional complication is that many CGI inputs are provided in so-called ``URL-encoded'' format, that is, some values are written in the format %HH where HH is the hexadecimal code for that byte. You or your CGI library must handle these inputs correctly by URL-decoding the input and then checking if the resulting byte value is acceptable. You must correctly handle all values, including problematic values such as %00 (NIL) and %0A (newline). Don't decode inputs more than once, or input such as ``%2500'' will be mishandled (the %25 would be translated to ``%'', and the resulting ``%00'' would be erroneously translated to the NIL character).
CGI scripts are commonly attacked by including special characters in their inputs; see the comments above.
Some HTML forms include client-side checking to prevent some illegal values. This checking can be helpful for the user but is useless for security, because attackers can send such ``illegal'' values directly to the web server. As noted below (in the section on trusting only trustworthy channels), servers must perform all of their own input checking.
Programs must ensure that all inputs are controlled; this is particularly difficult for setuid/setgid programs because they have so many such inputs. Other inputs programs must consider include the current directory, signals, memory maps (mmaps), System V IPC, and the umask (which determines the default permissions of newly-created files). Consider explicitly changing directories (using chdir(2)) to an appropriately fully named directory at program startup.
Place timeouts and load level limits, especially on incoming network data. Otherwise, an attacker might be able to easily cause a denial of service by constantly requesting the service.