Unicode Converter Guide

Understand how character encoding works and solve mojibake, Emoji, and cross-platform text issues

Whether you're a developer handling multilingual text or an everyday user dealing with garbled characters, understanding Unicode encoding is the key to solving the problem. The Unicode Converter Tool helps you quickly convert between encodings, but mastering the underlying principles lets you tackle any situation.

1. Unicode Fundamentals

1.1 What is Unicode

Unicode is a character encoding standard designed to cover all the world's writing systems. It assigns a unique number (called a "code point") to each character, formatted as U+XXXX where XXXX is a hexadecimal value.

As of Unicode 15.1, over 149,000 characters have been defined, covering 150+ scripts, thousands of symbols, and Emoji. Code points range from U+0000 to U+10FFFF, for a total of 1,114,112 possible code points.

1.2 Unicode vs Character Set vs Encoding

1.3 Code Planes

Unicode divides the code point space into 17 planes, each with 65,536 code points:

PlaneRangeNameContents
0U+0000 - U+FFFFBMP (Basic Multilingual Plane)Common characters: Latin, CJK, symbols
1U+10000 - U+1FFFFSMP (Supplementary Multilingual Plane)Emoji, historic scripts, music symbols
2U+20000 - U+2FFFFCJK ExtensionExtended CJK Unified Ideographs
3-13Reserved/UnassignedFuture expansion
14U+E0000 - U+EFFFFTags PlaneSpecial use
15-16Private UseUser-defined characters

2. UTF-8 Encoding in Detail

UTF-8 is the dominant encoding on the internet today, used by over 98% of web pages worldwide.

2.1 Encoding Rules

UTF-8 uses a variable-length 1-4 byte encoding:

Unicode RangeUTF-8 Byte PatternExample
U+0000 - U+007F0xxxxxxx (1bytes) A → 41
U+0080 - U+07FF110xxxxx 10xxxxxx (2bytes) é → C3 A9
U+0800 - U+FFFF1110xxxx 10xxxxxx 10xxxxxx (3bytes) medium → E4 B8 AD
U+10000 - U+10FFFF11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (4bytes) 😊 → F0 9F 98 8A

2.2 Advantages of UTF-8

💡 Pro tip: Use the Unicode Converter Tool to convert any text to UTF-8 hex, UTF-16, Unicode escape sequences, and more — handy for development and debugging.

3. UTF-16 Encoding in Detail

UTF-16 is another common Unicode encoding, used internally by Windows API, Java, and JavaScript.

3.1 Encoding Rules

UTF-16 uses 2 or 4 bytes:

Surrogate pair mechanism:

For example, 😊 (U+1F60A) has the UTF-16 surrogate pair: D83D DE0A

3.2 UTF-16 BOM

UTF-16 has two byte order variants:

This is a drawback of UTF-16 — an additional BOM is needed to identify byte order, adding complexity.

4. UTF-8 vs UTF-16 vs UTF-32

FeatureUTF-8UTF-16UTF-32
Byte length1-4 bytes2 or 4 bytesFixed 4 bytes
ASCII efficiencyBest (1 byte)Poor (2 bytes)Worst (4 bytes)
CJK efficiencyAverage (3 bytes)Good (2 bytes)Poor (4 bytes)
Random accessO(n)close toO(1)*O(1)
Memory usageSmallestMediumLargest
Byte order issuesNoneYesYes
Primary useWeb, file storageWindows, Java, JSInternal processing

*UTF-16 allows O(1) random access for BMP characters, but surrogate pairs require special handling.

5. How Emoji Encoding Works

Emoji are part of the Unicode standard, and their encoding involves several special mechanisms:

5.1 Basic Emoji

Most common Emoji are in the Supplementary Multilingual Plane (SMP) and use 4-byte UTF-8 encoding:

5.2 Zero Width Joiner (ZWJ)

ZWJ (U+200D) combines multiple Emoji to create new meanings:

These combinations are multiple code points in text processing but render as a single glyph.

5.3 Skin Tone Modifiers

Emoji skin tones are achieved by appending Fitzpatrick modifiers to the base Emoji:

5.4 Regional Indicators

Flag Emoji are composed of two regional indicator letters:

6. Common Encoding Issues & Solutions

6.1 Mojibake (Garbled Text)

Cause: Text decoded with the wrong encoding. For example, UTF-8 Chinese text opened with GBK:

UTF-8 "Hello" → E4 BD A0 E5 A5 BD
Decoded as GBK → "broken" (mojibake)

Solution: Confirm the original encoding and open with the correct one. Most modern editors (VS Code, Sublime Text) auto-detect encoding.

6.2 Question Marks / Squares

Cause: The target encoding doesn't support the character. For example, GBK can't represent some rare characters or special symbols.

Solution: Use UTF-8 encoding, which can represent all Unicode characters.

6.3 BOM-Related Issues

Symptoms:  at the start of PHP files, JSON parsing failures, CSV misalignment in Excel.

Solution: Save files as UTF-8 without BOM. Select "UTF-8 without BOM" in your editor.

6.4 MySQL Encoding Issues

Charset mismatch between database and tables is a common issue. Ensure the entire chain uses the same encoding:

-- Set connection charset
SET NAMES utf8mb4;

-- Create database
CREATE DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

-- Create table
CREATE TABLE articles (
  content TEXT CHARACTER SET utf8mb4
);

Note: Use utf8mb4 instead of utf8 — MySQL's utf8 only supports up to 3 bytes and cannot store 4-byte characters like Emoji.

6.5 URL Encoding

URLs can only use a subset of ASCII characters. Non-ASCII characters require percent encoding:

Modern browsers display Unicode characters directly in URLs (IDN/Punycode), but the underlying transmission still uses ASCII encoding.

Need to convert Unicode encoding quickly?

Try the Unicode Converter Tool →

Summary

Unicode and UTF-8 are the foundation of modern text processing. Understanding encoding principles helps you solve mojibake, database storage, and Emoji handling issues from the root. Remember the key points: prefer UTF-8, be aware of MySQL's utf8 vs utf8mb4 difference, and account for ZWJ and modifier complexity when handling Emoji. The Online Unicode Converter is your go-to tool for encoding problems.