Whether you're a developer handling multilingual text or an everyday user dealing with garbled characters, understanding Unicode encoding is the key to solving the problem. The Unicode Converter Tool helps you quickly convert between encodings, but mastering the underlying principles lets you tackle any situation.
1. Unicode Fundamentals
1.1 What is Unicode
Unicode is a character encoding standard designed to cover all the world's writing systems. It assigns a unique number (called a "code point") to each character, formatted as U+XXXX where XXXX is a hexadecimal value.
As of Unicode 15.1, over 149,000 characters have been defined, covering 150+ scripts, thousands of symbols, and Emoji. Code points range from U+0000 to U+10FFFF, for a total of 1,114,112 possible code points.
1.2 Unicode vs Character Set vs Encoding
- Character Set: A collection of characters, e.g., ASCII contains 128 characters
- Encoding Scheme: Rules for converting characters to binary
- Unicode: Both a character set (defines each character's number) and a family of encoding schemes (UTF-8, UTF-16, UTF-32)
1.3 Code Planes
Unicode divides the code point space into 17 planes, each with 65,536 code points:
| Plane | Range | Name | Contents |
|---|---|---|---|
| 0 | U+0000 - U+FFFF | BMP (Basic Multilingual Plane) | Common characters: Latin, CJK, symbols |
| 1 | U+10000 - U+1FFFF | SMP (Supplementary Multilingual Plane) | Emoji, historic scripts, music symbols |
| 2 | U+20000 - U+2FFFF | CJK Extension | Extended CJK Unified Ideographs |
| 3-13 | — | Reserved/Unassigned | Future expansion |
| 14 | U+E0000 - U+EFFFF | Tags Plane | Special use |
| 15-16 | — | Private Use | User-defined characters |
2. UTF-8 Encoding in Detail
UTF-8 is the dominant encoding on the internet today, used by over 98% of web pages worldwide.
2.1 Encoding Rules
UTF-8 uses a variable-length 1-4 byte encoding:
| Unicode Range | UTF-8 Byte Pattern | Example |
|---|---|---|
| U+0000 - U+007F | 0xxxxxxx (1bytes) | A → 41 |
| U+0080 - U+07FF | 110xxxxx 10xxxxxx (2bytes) | é → C3 A9 |
| U+0800 - U+FFFF | 1110xxxx 10xxxxxx 10xxxxxx (3bytes) | medium → E4 B8 AD |
| U+10000 - U+10FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (4bytes) | 😊 → F0 9F 98 8A |
2.2 Advantages of UTF-8
- ASCII-compatible: Pure ASCII text is identical in UTF-8 — zero migration cost
- Self-synchronizing: Can find character boundaries from any byte position
- Compact: Very space-efficient for English and Latin text
- No byte order issues: UTF-8 has a fixed byte order — no BOM needed
3. UTF-16 Encoding in Detail
UTF-16 is another common Unicode encoding, used internally by Windows API, Java, and JavaScript.
3.1 Encoding Rules
UTF-16 uses 2 or 4 bytes:
- BMP characters (U+0000 - U+FFFF): Directly represented in 2 bytes
- Supplementary characters (U+10000 - U+10FFFF): Use surrogate pairs, taking 4 bytes
Surrogate pair mechanism:
- High surrogate: U+D800 - U+DBFF (subtract U+D800, take high 10 bits)
- Low surrogate: U+DC00 - U+DFFF (subtract U+DC00, take low 10 bits)
For example, 😊 (U+1F60A) has the UTF-16 surrogate pair: D83D DE0A
3.2 UTF-16 BOM
UTF-16 has two byte order variants:
- UTF-16 BE (Big Endian): High byte first, BOM is FE FF
- UTF-16 LE (Little Endian): Low byte first, BOM is FF FE
This is a drawback of UTF-16 — an additional BOM is needed to identify byte order, adding complexity.
4. UTF-8 vs UTF-16 vs UTF-32
| Feature | UTF-8 | UTF-16 | UTF-32 |
|---|---|---|---|
| Byte length | 1-4 bytes | 2 or 4 bytes | Fixed 4 bytes |
| ASCII efficiency | Best (1 byte) | Poor (2 bytes) | Worst (4 bytes) |
| CJK efficiency | Average (3 bytes) | Good (2 bytes) | Poor (4 bytes) |
| Random access | O(n) | close toO(1)* | O(1) |
| Memory usage | Smallest | Medium | Largest |
| Byte order issues | None | Yes | Yes |
| Primary use | Web, file storage | Windows, Java, JS | Internal processing |
*UTF-16 allows O(1) random access for BMP characters, but surrogate pairs require special handling.
5. How Emoji Encoding Works
Emoji are part of the Unicode standard, and their encoding involves several special mechanisms:
5.1 Basic Emoji
Most common Emoji are in the Supplementary Multilingual Plane (SMP) and use 4-byte UTF-8 encoding:
- 😀 (U+1F600) → UTF-8:
F0 9F 98 80 - ❤️ (U+2764 U+FE0F) → UTF-8:
E2 9D A4 EF B8 8F
5.2 Zero Width Joiner (ZWJ)
ZWJ (U+200D) combines multiple Emoji to create new meanings:
- 👨 + ZWJ + 💻 = 👨💻 (Male programmer)
- 🏳️ + ZWJ + 🌈 = 🏳️🌈 (Rainbow flag)
These combinations are multiple code points in text processing but render as a single glyph.
5.3 Skin Tone Modifiers
Emoji skin tones are achieved by appending Fitzpatrick modifiers to the base Emoji:
- 👍 (U+1F44D) → Base (default yellow)
- 👍🏻 (U+1F44D U+1F3FB) → Light skin
- 👍🏿 (U+1F44D U+1F3FF) → Dark skin
5.4 Regional Indicators
Flag Emoji are composed of two regional indicator letters:
- 🇨🇳 = 🇨 (U+1F1E8) + 🇳 (U+1F1F3) → China
- 🇺🇸 = 🇺 (U+1F1FA) + 🇸 (U+1F1F8) → United States
6. Common Encoding Issues & Solutions
6.1 Mojibake (Garbled Text)
Cause: Text decoded with the wrong encoding. For example, UTF-8 Chinese text opened with GBK:
UTF-8 "Hello" → E4 BD A0 E5 A5 BD Decoded as GBK → "broken" (mojibake)
Solution: Confirm the original encoding and open with the correct one. Most modern editors (VS Code, Sublime Text) auto-detect encoding.
6.2 Question Marks / Squares
Cause: The target encoding doesn't support the character. For example, GBK can't represent some rare characters or special symbols.
Solution: Use UTF-8 encoding, which can represent all Unicode characters.
6.3 BOM-Related Issues
Symptoms:  at the start of PHP files, JSON parsing failures, CSV misalignment in Excel.
Solution: Save files as UTF-8 without BOM. Select "UTF-8 without BOM" in your editor.
6.4 MySQL Encoding Issues
Charset mismatch between database and tables is a common issue. Ensure the entire chain uses the same encoding:
-- Set connection charset SET NAMES utf8mb4; -- Create database CREATE DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; -- Create table CREATE TABLE articles ( content TEXT CHARACTER SET utf8mb4 );
Note: Use utf8mb4 instead of utf8 — MySQL's utf8 only supports up to 3 bytes and cannot store 4-byte characters like Emoji.
6.5 URL Encoding
URLs can only use a subset of ASCII characters. Non-ASCII characters require percent encoding:
- "Hello" →
%E4%BD%A0%E5%A5%BD - "mediumwen.com" →
%E4%B8%AD%E6%96%87.com
Modern browsers display Unicode characters directly in URLs (IDN/Punycode), but the underlying transmission still uses ASCII encoding.
Need to convert Unicode encoding quickly?
Try the Unicode Converter Tool →Summary
Unicode and UTF-8 are the foundation of modern text processing. Understanding encoding principles helps you solve mojibake, database storage, and Emoji handling issues from the root. Remember the key points: prefer UTF-8, be aware of MySQL's utf8 vs utf8mb4 difference, and account for ZWJ and modifier complexity when handling Emoji. The Online Unicode Converter is your go-to tool for encoding problems.