Input Filtering, Part 2

Strings and Numbers

This article was first published in the “Tips & Tricks” column in php|architect magazine.

This year has seen an increased focus on PHP security, and this is good for the language, developers, and business community. One phrase that comes to mind when discussing secure coding practices is Chris Shiflett’s mantra of “filter input, escape output.” While we know what this means in a general sense, practical examples elude us. This month’s installment of Tips & Tricks continues the series on filtering input, providing practical examples and helpful tips to filter strings and numbers.

Welcome back to the second installment of the Tips & Tricks input filtering series. If you’ve been following along, you’ll know that this is the second of a three-part series on filtering input, and by “input,” I don’t mean only user input from an HTML form. I mean input from any external source, be it from GET, POST, cookies, RSS feeds, XML-RPC, etc.—any place from which an application accepts outside data beyond the control of the program- mer. That’s the data that needs filtering.

So, to summarize part one of this series: input should always be considered evil and tainted and, thus, must be filtered, and to properly filter input, a whitelist approach is the most logical solution to ensure that input received is input expected.

Continuing this short review of last month’s column, take a brief look at Listing 1. Without dwelling too much on this code listing, I’d like to point out that the whitelist approach here works merely to ensure that the received data adheres to a strict set of field names. Another form on another site could post all manner of different fields to this form, but the $clean array will only contain the expected and intended fields. By now, it should be clear why a whitelist approach is the most desirable form of filtering data; it requires only the knowledge of what the form should receive—not the myriad data the form could receive.

For now, we’ll skim over the code in Listing 1, but I’ll return to it later to expand on it and enhance the whitelist approach shown to filter data down to specific types.

Listing 1.
<?php
function filter ($input, $allowed) {
$filtered = array();
foreach ($input as $key => $value) {
if (in_array($key, $allowed)) {
$filtered[$key] = $value;
}
}
return $filtered;
}
$whitelist = array(
'name',
'street',
'city',
'state',
'postal_code',
'phone',
'email',
);
if (is_array($_POST)) {
$clean = filter($_POST, $whitelist);
}
?>
<form method="POST">
Name: <input type="text" name="name" maxlength="50" /><br />
Street: <input type="text" name="street" maxlength="100" /><br />
City: <input type="text" name="city" maxlength="50" /><br />
State:
<select name="state">
<option>Pick a state...</option>
<option>Alabama</option>
<option>Alaska</option>
<option>Arizona</option>
...
</select><br />
Postal Code: <input type="text" name="postal_code" maxlength="5" /><br />
Phone: <input type="text" name="phone" maxlength="25" /><br />
E-mail: <input type="text" name="email" maxlength="255" /><br />
<input type="submit" value="Submit" />
</form>

Checking for Input

I want to take this time to point out a few erroneous practices or assumptions made by developers, especially when checking for the existence of input. I’ll use this discussion as a jumping point to segue into the meat of this installment, which is a discussion on filtering for strings and numbers.

Take, for example, the following line of code:

if ($_POST['name']) {

An if statement, obviously, checks for a TRUE or FALSE value and evaluates to TRUE on any non-false value (-1, 1, or any character or other number), but this is where it gets tricky in the case of the above line.

In PHP, FALSE is defined as containing a value of the Boolean FALSE itself, the integer zero (0), the float zero (0.0), an empty string or the string “0”, an array with zero elements, or the special type NULL.

With this in mind, consider how this if statement will react when a user enters the number zero (0) as a value in the name field. The if statement will treat it as FALSE, and whatever action the if statement was supposed to take will be bypassed, likely passing control to an else statement. This seems like a simple no-brainer, but I have seen many applications open to public scrutiny using similar lines of code.

Further still, I have seen many seek to correct this problem by using the following line of code instead:

if (!empty($_POST['name'])) {

However, the result is very similar. The empty() function evaluates to TRUE when a string is empty, but an empty string is defined as a true empty string, the string “0”, the integer zero, an empty array, a declared variable with no value (as in a class), NULL, or FALSE.

So, again, when a user enters the number zero as a value for name, this line evaluates to FALSE, bypassing the code within the if statement.

Finally, to further reiterate bad practices used in checking for the existence of input, I have seen the use of isset() to check whether an input variable contains any data. Again, this will cause problems since a variable can be empty yet still considered set.

So, with three flawed examples for checking the existence of input, what exactly is the best way to check for data?

The best way I have found to check for the existence of data in a variable is to check the length of the strings. Since PHP is loosely typed, strlen() may be used on any type of data and will return a positive value for anything other than FALSE or NULL. Note that it is not advisable to use strlen() to check the length of anything other than strings, but it works fine to determine whether data exists in input because data received from external sources will always be of the string datatype, even if the data contains numbers.

So, a better way to check for the existence of data in input is to use a line similar to the following:

if (strlen($_POST['name']) > 0) {

Keep in mind, though, that strlen() counts the number of spaces in the variable, as well, so passing a “blank” variable of one or two spaces will pass this test. Use trim() to ensure this doesn’t happen: strlen(trim($_POST['name'])). Just beware that a non-existent field will always evaluate to zero on a strlen() check, so checking for strlen($_POST['name']) == 0 is not recommended.

Promoting a Whitelist Approach

It behooves me to reiterate the importance of a whitelist approach, however. It is quite impossible to tell exactly what data an application will receive over the span of its life. Thus, trying to guess every possible undesirable value is not my idea of fun. In fact, it simply can’t be done.

So, instead of wasting time trying to determine what input should be considered “bad,” think of what input is actually good and acceptable, and check for that. Indeed, you should already know what data is acceptable to your application—you built it, after all.

In the examples listed earlier, the code essentially checks for input that is not acceptable (empty fields, or no data). This is a blacklist approach, and I don’t advocate it. My end suggestion is merely the best way to check for an empty input variable, but it is not the approach I want to promote. Instead, I want to encourage readers to adopt a whitelist approach when checking input—ensure that input received is input expected.

The PHP ctype functions are a step in the right direction and may be the only functions needed to check for expected input in many cases.

Using ctype Functions

The PHP ctype functions have been included by default in PHP since version 4.2.0 and built-in support has been available since version 4.3.0, so these functions exist and are available, and there’s no reason not to use them.

The ctype functions come from the standard C library and check every single character in the string (or number) passed to the function. If every character matches the type being checked, then the function returns TRUE. Otherwise, it returns FALSE.

Take, for example, the code snippet shown in Listing 2. First, I initialize the $clean array. Then, I use ctype_alpha() on a username input variable to ensure that the variable contains only alphabetic characters.

Listing 2.
<?php
$clean = array();
if (ctype_alpha($_POST['username'])) {
$clean['username'] = $_POST['username'];
}
?>

If this function encounters any character other than an uppercase or lowercase alphabetic character, according to the current locale, then it will return FALSE.

Recall from Part 1 of this series that it is important to store filtered input to a separate variable from the originating variable, hence the $clean array used in Listing 2. Aside from the fact that this aids the programmer in keeping track of what is clean and what is tainted, this will ensure that absolutely nothing that could be tainted will be used. Everything in the $clean array should be filtered before being added to the array. Do not, under any circumstances, do something similar to the following:

$clean = $_POST;
$clean = filter($clean);

This approach is counterproductive to the filtering process. The nature of the filtering process is to let only the data that is expected pass through. The code above demonstrates a backwards method of filtering—all data passes through first and is later filtered. If $_POST contains a username field with invalid characters and the imaginary filter() function used here doesn’t function properly, then $clean will contain the tainted username value.

Now, ctype_alpha() checks for the presence of alphabetic characters, but what if a number is passed to this function from, for example, the postal_code field in Listing 1? Obviously, it will return FALSE because a number character is definitely not alphabetic. So, for input variables that only contain numeric values, use ctype_digit(), as seen in Listing 3.

Listing 3.
<?php
if (ctype_digit($_POST['postal_code'])) {
$clean['postal_code'] = $_POST['postal_code'];
}
?>

Likewise, I may not want to limit the username field mentioned above to purely alphabetic characters, so, since I want it to accept any alphabetic or numeric characters, I’ll use ctype_alnum() instead. Yet, even this function still does not allow for spaces, hyphens, underscores, or punctuation. If I want to check for any valid printable characters for, say, the name or street fields in Listing 1, then I need another function. Again, ctype provides just the function: ctype_print(). This function will check for the presence of all printable characters. If it encounters any control characters or characters that do not have any control function or output at all, then it returns FALSE.

So, now, we are armed with an arsenal of functions that provide an excellent whitelist approach to checking input variables and we didn’t need to learn any regular expressions. In fact, the ctype functions perform faster than functions that require a regular expression, such as preg_match() or ereg(), and thus are preferred over regular expressions. To learn more about the ctype functions, see http://www.php.net/ctype.

Putting It All Together

Now that we have a good handful of functions to use for applying a whitelist approach to input, let’s revisit that form in Listing 1 and see what can be done to improve upon it.

Remember that the processing code in Listing 1 merely checks to ensure that expected input variables are saved to the $clean array. While this doesn’t seem like much, it is a way of separating tainted data from expected data. Already, the $clean array contains only the variables expected, but that’s not enough because this input is still tainted. It may contain unexpected values, and, so, now, using the ctype functions, take a look at Listing 4 to see how this whitelist approach can be improved to not only ensure that $clean contains only the expected variables but that each variable contains only the expected type of data.

Listing 4.
<?php
function filter ($input, $allowed) {
$filtered = array();
foreach ($input as $key => $value) {
if (array_key_exists($key, $allowed)) {
switch ($allowed[$key]) {
case 'string':
$value = (ctype_print($value)) ? $value : '';
break;
case 'int':
$value = (ctype_digit($value)) ? $value : '';
break;
}
$filtered[$key] = $value;
}
}
return $filtered;
}
$whitelist = array(
'name' => 'string',
'street' => 'string',
'city' => 'string',
'state' => 'string',
'postal_code' => 'int',
'phone' => 'string',
'email' => 'string',
);
if (is_array($_POST)) {
$clean = filter($_POST, $whitelist);
}
?>

In this case, I have modified the $whitelist array to contain not only the names of expected fields but also their types. This is a simplistic approach, however, and so there are only the types of “string” and “int,” but that is enough for now.

Notice how the $clean variable will contain an empty instance of the input variable if the input fails the ctype test. Likewise, if the variable passed to the ctype function is empty, it will return TRUE and save an empty value to the $clean array. This may not be desired in cases where a field should be required, but this is the absolute worst thing that the $clean array might contain—in short, none of the data in $clean is tainted. It is all good and acceptable. If any variable contains an empty value and shouldn’t, then it is possible to build in some form of error checking, but an empty value in the $clean array is the least of our worries.

In the next and final installment of this three-part series on filtering data, I’ll take a look at regular expressions and how they can be utilized to further ensure that input received is input expected—including tips on how to filter phone numbers, e-mail addresses, select lists, and other information.

Until then, keep your input clean and your data keen!