Input Filtering, Part 1
Why Filter?
This year has seen an increased focus on PHP security, and this is good for the language, developers, and business community. One phrase that comes to mind when discussing secure coding practices is Chris Shiflett’s mantra of “filter input, escape output.” While we know what this means in a general sense, practical examples elude us, so for the next three months, Tips & Tricks will give practical suggestions for input filtering, chock full of code examples.
Filter input. What does that mean? Well, in short, it means what it says, but there’s something deeper hidden behind these words, something sinister. Yes, these words mean user input cannot be trusted. For that matter, no input, regardless of its source—forms, RSS feeds, cookies, etc.—is trustworthy. In fact, the level of distrust in input must be so high that you no longer accept anything from these sources at face value. Always verify the input data to ensure it’s the expected, genuine article.
But why is this so hard to do? Is it because we innately want to trust people and other sources? Heavens, no! It’s hard because programmers are naturally lazy.
Filtering input means writing more code, writing smarter code. For those who wish to finish a project quickly, this is daunting, and so they quickly scribble down some code—if, in fact, code can be scribbled—and deploy a release hoping to catch the problems in later bugfix (sometimes called security) releases. This can, however, cause great problems in the meantime, not the least of which could consist of SQL injection or cross-site scripting (XSS)…or just plain bad data.
Ensuring against bad data through filtering input is what we’ll focus on over the next three installments of Tips & Tricks. So, come along with me, and before we’re finished, you’ll be cynical and distrustful with the best of them—no longer able to trust input of any kind—and, thus, security-conscious.
Why Filter Input?
Input is bad. In fact, it’s evil. Just get that through your head, and you’ll be off to a great start.
Input is evil because its source cannot be trusted and the type of data expected is not always the type received, and all the client-side validation scripts in the world can’t stop input coming from another source completely invalidated.
What do I mean by “another source?” I mean: another form on another Web site that makes use of your form (often referred to as a spoofed form) for some insidious means—or someone or some script posting by any number of alternative means.
Let’s take, for example, the form in Listing 1, which is located at the imaginary URL http://example.net/form.html. (We’ll continue to come back to this form during the next few months; don’t worry—the code will be included in each column.) Now, this is a form we’ve all seen; it asks for a name and contact information—no doubt, you’ve used a similar form in the past, and there’s nothing wrong with this form, but there are a few assumptions often made about it.
<!-- A form located at: http://example.net/form.html -->
<form method="POST" action="process_form.php">
Name: <input type="text" name="name" maxlength="50" /><br />
Street: <input type="text" name="street" maxlength="100" /><br />
City: <input type="text" name="city" maxlength="50" /><br />
State:
<select name="state">
<option>Pick a state...</option>
<option>Alabama</option>
<option>Alaska</option>
<option>Arizona</option>
...
</select><br />
Postal Code: <input type="text" name="postal" maxlength="5" /><br />
Phone: <input type="text" name="phone" maxlength="25" /><br />
E-mail: <input type="text" name="email" maxlength="255" /><br />
<input type="submit" value="Submit" />
</form>
One assumption is that the maxlength
attribute of the fields prevents a user from entering more text than allowed. This is wrong. While a Web browser can correctly prevent a user from doing so through this particular form, there’s nothing to stop the re-creation of this form on another server and using it to submit a much longer string of data.
Another assumption is that the user may pick states only from among the options listed in the state drop-down field. Again, this is wrong and for the same reasons. The Web browser might prevent said user from entering other values when using this form, but if recreated, the sky’s the limit.
We’re starting to see a pattern emerge. A Web form/application is safe only when used properly. This is obvious. But if used improperly, then processing scripts can receive any and all kinds of input.
Still, let’s look at two more assumptions about this form—just for the heck of it.
This form has a set number of fields. Does that mean these are the only fields that can be submitted? No! Also, can we assume that the processing script (process_form.php
in this case) can only receive submissions from this form? The answer, again, is no.
The form in Listing 2 illustrates why these assumptions are wrong. This form lives on another server—for example, at http://evil.example.net/form-spoof.html.
<form method="POST" action="http://example.net/process_form.php">
<input type="hidden" name="name" value="Frodo Baggins" />
<input type="hidden" name="state" value="The Shire" />
<input type="hidden" name="postal" value="It is my precious!" />
<input type="hidden" name="junk" value="Junk data being passed" />
<input type="submit" value="Submit" />
</form>
The first thing to notice about this form is that there are no maxlength
attributes. Well, for one, these are hidden fields that don’t use the maxlength
attribute, but that’s not important. The fields don’t have to be hidden, and, either way, a devious miscreant may enter as much data as he pleases. Secondly, the state field now has a value of “The Shire.” Wait a minute…that wasn’t in our option list, but it doesn’t matter because it’ll post just fine. Thirdly, this form includes a new field: the junk
field. This doesn’t do much now, but consider a server where register_globals
is enabled and variables aren’t initialized—think about what it can do.
The Referer
Question
Invariably, the question now arises: But what about the Referer
? Yes, what about it? I can check it, right? Sure, go ahead, but it’ll bite you in the end.
It is a common misconception that every request includes a Referer
header and that the value of this header always represents the origin of the request. In truth and practice, the origin of the request is always the client. The client can be a Web browser or it can be a script that resides on a server, somewhere. It may or may not choose to include a Referer
header in requests. The Referer
, when included, may or may not indicate the previously requested parent resource. In fact, some proxy servers have been known to modify or drop the Referer
header altogether, thus blocking entire offices and even ISPs from viewing Web sites programmed to check for it.
All this amounts to the fact that Referer
is highly unreliable as a means of protecting Web applications from outside posting. Furthermore, it is not as important to ensure input comes from a specific place as it is that the input received conforms to expectations.
Nevertheless, we’ll take a look at how scripts use Referer
to block requests from other sites:
if (strcmp($_SERVER[‘HTTP_REFERER’], 'http://example.net/form.html') == 0) {
// It came from the right place, so let's process it
}
Now, this snippet of code will properly thwart a form such as the one in Listing 2 from posting to process_form.php
, so long as the client includes a Referer
header that doesn’t match, but mischievous users aren’t in the business of being foiled by clients. Let’s consider another means of posting and take a look at Listing 3.
<?php
// Using PEAR::HTTP_Request
require_once 'HTTP/Request.php';
$req =& new HTTP_Request('http://example.net/process_form.php');
$req->setMethod(HTTP_REQUEST_METHOD_POST);
$req->addHeader('Referer', 'http://example.net/form.html');
$req->addPostData('name', 'Gandalf the Grey');
$req->addPostData('state', 'Middle-earth');
$req->addPostData('email', 'Olorin I was in my youth');
$response = $req->sendRequest();
?>
The code in Listing 3 is similar to that found in Listing 2 in that it posts to process_form.php
from a different location and bypasses all the local constraints placed on it (e.g. maxlength
and any client-side scripting). However, Listing 3 is different because it doesn’t rely on a Web browser and, thus, can modify any part of the request. In this case, PEAR::HTTP_Request generates a valid POST request, while adding a Referer
header. Thus, the script successfully posts to process_form.php
because it sends a valid Referer
header with a value that process_form.php
expects.
Now You’re Getting It
And so, we must filter the input. It’s that simple. We cannot be sure the input comes from the proper location, nor are we sure it is exactly what we want. In fact, we’re pretty sure it’s not.
Feeling distrustful yet? Good. Great, even. Do not trust input from users, from anywhere. This is why it’s important to ensure that input received is input expected.
The approach we’ll take to filter input is often called a “whitelist” approach (as opposed to a “blacklist” approach). Instead of using a blacklist to tell our script what kind of input we won’t allow (e.g. input coming from somewhere other than form.html, as in the Referer
example), we’ll use a whitelist to tell it exactly what to allow.
This is actually a much simpler approach because, now, we don’t have to think of the myriad kinds of data an attacker might try to submit to our script. Instead, we need only know what we want to receive and ensure that the received input matches up.
Capturing and Taming Input
Now, let’s talk about capturing some of this evil input.
There are a few places we’ll consider looking for input: $_GET
, $_POST
, and $_COOKIE
. We’ll not look in $_REQUEST
, though it does contain the values from each of these superglobal arrays. In short, we want to know the exact scope of the input, so we’ll use the specific superglobal for the location we expect to find it. For example, $_REQUEST['name']
could refer to $_GET['name']
, $_POST['name']
, or even $_COOKIE['name']
, so we want to be sure it’s coming from the correct location, which is POST in this case.
Luckily for us, PHP has already done the work of capturing the input. In process_form.php
, the values passed by the input from—form.html
(or from wherever it was submitted)—are in $_POST
. But the data in $_POST
, you’ll remember, is still evil data. We must first filter it.
There’s more than one way, however, to filter form input, and I won’t pretend that my suggestions are any more than what they are: suggestions. They are not the right way, but they are a way, and these tips are sure to help control input and provide a foundation on which to build. What’s important is to write code with a security-conscious mindset, and part of that mindset includes being wary of input.
Now, to keep track of our good data, we’ll store everything that’s considered clean (as in: it conforms to expectations) to the aptly named $clean
array, which will somewhat mimic everything that’s in $_POST
—without all the evil tendencies.
One approach that I often see is a sanitizing function that gets applied to the $_POST
array, as seen in Listing 4. While this type of approach removes harmful characters, it does not provide a whitelist solution. Instead, it blacklists potentially harmful characters (control characters) and escapes the input (with htmlentities()
), which is not a part of the filtering process. We’re only concerned with filtering the input at this point, so we want the raw data—filtered, but raw. Escaping will take place during the output stage, which isn’t covered here.
<?php
function sanitize(&value, $key)
{
if (!ctype_print($value)) {
$value = preg_replace('/[[:cntrl:]]/', '', $value);
}
$value = htmlentities($value);
}
$clean = $_POST;
array_walk($clean, 'sanitize');
?>
A whitelist approach defines the valid range of characters/numbers, the acceptable values (of a select
field, for example), and the allowed fields. For now, let’s take a look at defining the allowed fields to ensure we receive and process nothing more than expected.
Listing 5 gives a whitelist example for defining the allowed fields. First, we use the $white_list
array to define the allowed fields. Then, we run the $_POST
array through the filter()
function using $white_list
as a model. What’s returned to the $clean
array is the expected input. Anything unexpected is left back in $_POST
where it safely remains excluded from the rest of the script.
<?php
function filter($input, $allowed) {
$validated = array();
foreach ($input as $key => $value) {
if (in_array($key, $allowed)) {
$validated[$key] = $value;
}
}
return $validated;
}
$white_list = array(
'name',
'street',
'city',
'state',
'postal',
'phone',
'email',
);
$clean = filter($_POST, $white_list);
?>
This is a very simple approach that does not include any further input checking—for now. Though, I hope it is evident how this approach adds a level of flexibility to the filtering process. For example, imagine a $post_white_list
, $get_white_list
, or even $rss_white_list
. Now, it becomes clear that this simple example can expand to filter anything:
$post_clean = filter($_POST, $post_white_list);
$get_clean = filter($_GET, $get_white_list);
$rss_clean = filter($rss, $rss_white_list);
In next month’s column, I’ll revisit this same code and discuss strategies for defining the data type for each field.
Wrap Up
By now, you should be fully convinced that all input is evil and why it’s important to filter all incoming data. When it comes to input, there are no guarantees as to the origin of the data or the type received. Whether working with GET, POST, cookies, RSS feeds, and the like, always filter input—regardless.
Tune in next month when we’ll wrestle more input to ensure input received is input expected.