Input Filtering, Part 3

Ensuring Input Received Is Input Expected

This article was first published in the “Tips & Tricks” column in php|architect magazine.

This year has seen an increased focus on PHP security, and this is good for the language, developers, and business community. One phrase that comes to mind when discussing secure coding practices is Chris Shiflett’s mantra of “filter input, escape output.” While we know what this means in a general sense, practical examples elude us. This month’s installment of Tips & Tricks concludes the series on filtering input, providing practical examples and helpful tips to filter input using regular expressions, test for the length of data, and ensure acceptable values.

Part one of this series introduced the need to filter input and explained why all input, whether from a user or an RSS feed, should be considered tainted. I also introduced the whitelist approach as a best practice for filtering input. Part two further explained the whitelist approach, exploring the use of the ctype functions as excellent tools to implement a whitelist-based filter. Recall from parts one and two the HTML form used for discussion. I have included a modified version of this form in Listing 1. For the purposes of the present discussion, I have added the age, color, and username fields. Listing 2 shows the processing form as seen at the end of part two.

Rounding out my three-part series on filtering input, this installment of Tips & Tricks includes discussion on using regular expressions to filter input, testing for the length of input, and ensuring the presence of acceptable values (e.g. from select, radio, or checkbox form fields, etc.).

Listing 1.
<form method="POST">
Name: <input type="text" name="name" maxlength="50" /><br />
Street: <input type="text" name="street" maxlength="100" /><br />
City: <input type="text" name="city" maxlength="50" /><br />
State:
<select name="state">
<option>Pick a state...</option>
<option>Alabama</option>
<option>Alaska</option>
<option>Arizona</option>
<!— ... —>
</select><br />
Postal Code: <input type="text" name="postal_code" maxlength="10" /><br />
Phone: <input type="text" name="phone" maxlength="25" /><br />
E-mail: <input type="text" name="email" maxlength="255" /><br />
Age: <input type="text" name="age" maxlength="3" /><br />
Color:<br />
Blue <input type="checkbox" name="color[]" value="blue" /><br />
Red <input type="checkbox" name="color[]" value="red" /><br />
Green <input type="checkbox" name="color[]" value="green" /><br />
Yellow <input type="checkbox" name="color[]" value="yellow" /><br />
Username: <input type="text" name="username" maxlength="16" /><br />
<input type="submit" value="Submit" />
</form>
Listing 2.
<?php
function filter ($input, $whitelist) {
$clean = array();
foreach ($input as $key => $value) {
if (array_key_exists($key, $whitelist)) {
switch ($whitelist[$key]) {
case 'string':
$clean[$key] = (ctype_print($value)) ? $value : '';
break;
case 'int':
$clean[$key] = (ctype_digit($value)) ? $value : '';
break;
}
}
}
return $clean;
}
$post_whitelist = array(
'name' => 'string',
'street' => 'string',
'city' => 'string',
'state' => 'string',
'postal_code' => 'int',
'phone' => 'string',
'email' => 'string',
);
if ($_POST) {
$clean = filter($_POST, $post_whitelist);
}
?>

Filtering with Regular Expressions

In last month’s column, I discussed using PHP’s built-in character type (ctype) functions to filter input. When application design allows, the ctype functions provide a fast and easy-to-use interface to implement a whitelist approach to filtering input. However, application design doesn’t always allow this, and the ctype functions lack flexibility.

For example, ctype_alpha() only checks for alphabetic characters, while ctype_digit() checks for only numeric characters. ctype_alnum() checks for both, but then it doesn’t allow for the presence of spaces, underscores, hyphens, or any other non-alphanumeric characters (nor do the previous two mentioned functions). On the other hand, ctype_print() is too open, allowing all printable characters, and this isn’t always a desired approach.

When you know exactly what characters you want to allow, it’s best to restrict input to those characters—and only those characters. So, ctype_alnum() is good for usernames, and ctype_digit() is good for five-digit U.S. zip codes, but ctype_print() isn’t necessarily good for a first and last name, an e-mail address, or a phone number. Good application design defines what characters these fields should accept; good filtering accepts only these characters.

Enter PHP’s Perl-Compatible Regular Expression (PCRE) functions. These functions make up for their slowness—as compared to the ctype functions—with increased flexibility and power. Regular expressions can be used to match just about anything and can perform some amazing tasks.

Take, for example, the name field in Listing 1. In Listing 2, I define it as a “string” type and then the filter() function filters it using ctype_print(). The decision to use ctype_print() over ctype_alpha() should be clear: I wanted to allow users to enter a space between their first and last names. However, now users can enter all sorts of random characters, characters that should not be acceptable for a name, so I turn to a regular expression to match a name. First, I come up with the following to replace the ctype_print() function:

$clean[$key] = (preg_match('/^[A-Z ]*$/i', $value)) ? $value : '';

This works well for names such as “Ben Ramsey,” but suppose I want Tim O’Reilly or Tim Berners-Lee to fill out my form; I’ll need to allow more characters. Also, assuming I want to use the “string” type as a general purpose string filter, I’ll want to make the regular expression a bit more liberal—but not too liberal. I’m still in control, so I want to accept only a small range of characters, a range of characters I deem acceptable.

A better, “general purpose” regular expression for matching strings is:

/^[-A-Z0-9\.\'"_ ]*$/i

I won’t go into the particular details of how regular expressions work. There are books and Web sites for that, but I will share a few of my preferred regular expressions for filtering standard types of information, such as e-mail addresses, phone numbers, and postal codes.

Looking back at Listing 2, I defined the postal code with the “int” type, which works well in certain circumstances when only the five-digit U.S. zip code is acceptable, but what if I want to accept a zip+4 postal code? These are typically written as “12345-1234,” and will cause ctype_digit() to return FALSE, because of the hyphen. Since the “int” type is useful in other situations (e.g. the age field), I won’t rewrite its definition. Instead, I’ll create a new type for “postal,” and create a regular expression to accept either a five-digit zip code or a zip+4 code (with or without the hyphen).

/^(\d{5})[\-]?(\d{4})?$/

Likewise, the e-mail and phone number fields in Listing 2 are of the “string” type, but I know that there are acceptable patterns I want to match for both of these. Plus, my existing “string” regular expression doesn’t allow the @-symbol, or parentheses. Thus, I create an “email” type and define its regular expression as:

/^[^@\s]+@([-a-z0-9]+\.)+[a-z]{2,}$/i

I also create a “phone” type, giving it the following expression:

/^[\(]?(\d{3})[\)]?[\s]?[\-]?(\d{3})[\s]?[\-]?(\d{4})[\s]?[x]?(\d*)$/

These two regular expressions will match most e-mail addresses or U.S. phone numbers. In fact, the expression used for phone numbers here can extract all the parts of a standard phone number to the matches parameter of preg_match(), if desired.

It should be noted, however, that the e-mail address regular expression used above will not match some addresses considered compliant according to RFC 822 guidelines. Take the following input, for example: “John Doe (home address) jdoe@example.com”. According to RFC 822 guidelines, this full string is acceptable, but the e-mail regular expression will reject it. Also, addresses that contain no TLD, such as jdoe@example, are valid RFC 822 addresses.

If RFC 822 compliance is necessary, then Listing 3 provides an alternative e-mail address filtering method using the PEAR::Mail package. This can also be accomplished using imap_rfc822_parse_adrlist() if PHP is compiled --with-imap. If portability is a concern, however, I suggest using the PEAR::Mail package.

Listing 3.
<?php
require_once 'PEAR.php';
require_once 'Mail/RFC822.php';
$parsed_email = Mail_RFC822::parseAddressList($_POST['email']);
if (!PEAR::isError($parsed_email)) {
$clean['email'] = $_POST['email'];
}

Testing Input Length

In part one of this series, I mentioned that, while the maxlength attribute of the HTML input tag controls how much data a user may enter when properly using a form located on the host site, it does not restrict the amount of data that a user may post when using a form located on another Web site, or when posting by some other means (see part one for more information).

Likewise, client-side validation with JavaScript may provide good measure for practicing “defense in depth,” as well as a potentially better user experience, but it will not restrict the actual data that can be sent to the form processing script from somewhere else (e.g. another form on another Web site). Thus, it is necessary to perform all input filtering, or validation, on the server side, in addition to any client-side validation.

Regardless of whether you filter input at the client, you must always filter input at the server.

I have seen many sites that provide a maxlength attribute in their input tags but fail to test the length of the field from the server side. This leaves the processing script open to receive all lengths of data, which can lead to database constraint violation errors and, potentially, more dangerous issues.

Checking the length of input, however, is simple, and, coupled with the maxlength attribute, it is easy to determine that a user is abusing the form if input received is longer than the expected length.

Listing 4 is a finalized version of the filter() function that incorporates all that I have discussed thus far. Notice how I have expanded $post_whitelist to include more information about each form field. Now, I associate an array with each field that defines the type of input to check against, in addition to several other details. One of those details is maxlength, which I check in the filter() function with:

if (isset($whitelist[$key]['maxlength'])
    && (strlen($value) > $whitelist[$key]['maxlength'])) {
    continue;
}

Here, I use the continue statement to skip to the next item in the foreach loop, essentially excluding this value from the $clean array if it contains more data than expected. Since I have maxlength defined for these fields in my form, I am confident that no user using my form is able to enter more data than expected. If the input contains values that are longer than their respective maxlength, then I can assume that the user is abusing my form in some way, and I can safely exclude the input from the $clean array.

Ensuring Acceptable Values

In much the same way that maxlength cannot be relied upon to stop would-be attackers from sending unlimited amounts of data to form processing scripts, the values displayed in HTML select, radio button, and checkbox lists are not the only values that can be posted. Thus, it is necessary to filter the values of these fields and ensure that the input received is input expected. Again, this is not a hard practice to implement, but it does require more code.

Take another look at Listing 4. In $post_whitelist, I’ve also added the “option” type, and for each item specified as type “option,” I have listed the expected options in the “options” array. For flexibility, I’ve also added the “multiselect” parameter that is defined on fields in which more than one item may be selected (i.e. checkboxes or menu lists).

In the filter() function, under the “option” case of the switch statement, I check whether the input received is an array. If it is, then I further check to ensure that I’m allowing the user to select more than one item. If not, then the input received shouldn’t be an array, and I discard the data and move on. If it is a multi-select field, then I check to ensure that every item in the array matches those defined in the “options” parameter for the field.

If it’s not an array, then I simply check to ensure that it matches one of the “options.” If it does, then I keep it; if not, then it is discarded.

If a value is not acceptable—that is, it doesn’t conform to expectations—then I don’t keep it. It doesn’t get added to the $clean array. Notice how all values in Listing 4 are now set to NULL if they don’t conform to expectations. Then, I check whether the value is null. If it is, I don’t save it to $clean. In part two of this series, recall that I did save it to the $clean array, with an empty value. I no longer do that, and, instead choose to completely discard the reference to the field. Now, the worst thing that can happen when working with user input is that a field doesn’t exist—but that’s easy to check and report.

Moving Right Along

Over the past three issues, I have given an in-depth look at input filtering in PHP. This discussion has covered such topics as “why to filter”, “using ctype functions and regular expressions”, and “validating the length and acceptable values of received input.” I have discussed this all the while promoting a whitelist approach to ensure that input received is input expected.

Until next time, happy coding!

Listing 4.
<?php
define('STRING', '/^[-A-Z0-9\.\'"_ ]*$/i');
define('EMAIL', '/^[^@\s]+@([-a-z0-9]+\.)+[a-z]{2,}$/i');
define('PHONE', '/^[\(]?(\d{3})[\)]?[\s]?[\-]?(\d{3})[\s]?[\-]?(\d{4})[\s]?[x]?(\d*)$/');
define('POSTAL_US', '/^(\d{5})[\-]?(\d{4})?$/');
$post_whitelist = array(
'name' => array(
'type' => 'string',
'maxlength' => 50,
),
'street' => array(
'type' => 'string',
'maxlength' => 100,
),
'city' => array(
'type' => 'string',
'maxlength' => 50,
),
'state' => array(
'type' => 'option',
'options' => array(
'Alabama',
'Alaska',
'Arizona',
),
),
'postal_code' => array(
'type' => 'postal',
'maxlength' => 10,
),
'phone' => array(
'type' => 'phone',
'maxlength' => 25,
),
'email' => array(
'type' => 'email',
'maxlength' => 255,
),
'age' => array(
'type' => 'int',
'maxlength' => 3,
),
'color' => array(
'type' => 'option',
'options' => array(
'blue',
'red',
'green',
'yellow',
),
'multiselect' => TRUE,
),
'username' => array(
'type' => 'username',
'maxlength' => 16,
),
);
if ($_POST) {
$clean = filter($_POST, $post_whitelist);
}
function filter($input, $whitelist)
{
$clean = array();
foreach ($input as $key => $value) {
if (array_key_exists($key, $whitelist)) {
$filtered = NULL;
if (isset($whitelist[$key]['maxlength']) && (strlen($value) > $whitelist[$key]['maxlength'])) {
continue;
}
switch ($whitelist[$key]['type']) {
case 'string':
$filtered = (preg_match(STRING, $value)) 79 ? $value : NULL;
break;
case 'int':
$filtered = (ctype_digit($value)) ? $value : NULL;
break;
case 'option':
if (is_array($value)) {
if ($whitelist[$key]['multiselect']) {
$filtered = array();
foreach ($value as $option) {
if (in_array($option, $whitelist[$key]['options'])) {
$filtered[] = $option;
}
}
}
} else {
$filtered = in_array($value, $whitelist[$key]['options']) ? $value : NULL;
}
break;
case 'username':
$filtered = (ctype_alnum($value)) ? $value : NULL;
break;
case 'email':
$filtered = (preg_match(EMAIL, $value)) ? $value : NULL;
break;
case 'phone':
$filtered = (preg_match(PHONE, $value)) ? $value : NULL;
break;
case 'postal':
$filtered = (preg_match(POSTAL_US, $value)) ? $value : NULL;
break;
}
if (!is_null($filtered)) {
$clean[$key] = $filtered;
}
}
}
return $clean;
}
?>