Duplicate data is one of the most common issues in data cleaning. Whether you're working with customer lists, product catalogs, email lists, or log data, duplicate entries lead to statistical bias, repeated messages, and wasted storage. Studies show that 5%-10% of enterprise database records are duplicates, and in organizations without proper cleaning workflows, this figure can reach 30%.
This guide systematically covers multiple deduplication methods — from the simplest online tools to Excel/Google Sheets operations, all the way to Python scripting — serving readers across different skill levels and use cases.
Before diving in, it's important to understand which type of duplicate you're dealing with.
Two or more rows where every field is identical. This is the simplest type — one command or button click handles it.
Certain key fields match but others differ. For example, the same customer appears twice in a database with slightly different address details. These require more careful handling — you may need to decide which record to keep.
Data looks similar but isn't identical, often due to typos or formatting differences (case, spaces, punctuation). For example, "Beijing", "Bei jing", and "beijing" all refer to the same city.
| Type | Example | Difficulty | Recommended Method |
|---|---|---|---|
| Exact Duplicate | "alice@email.com" appears 3 times | ⭐ Easy | Online Tool / Excel |
| Partial Duplicate | Same ID, different addresses | ⭐⭐ Medium | Multi-column dedup |
| Fuzzy Duplicate | "iPhone 15" vs "iphone15" | ⭐⭐⭐ Hard | Normalization + fuzzy matching |
For deduplicating plain text lists (keyword lists, URL lists, email lists), online tools are the fastest option. The Risetop Remove Duplicates Tool offers a zero-barrier experience:
The advantage of online tools is that no software installation or syntax knowledge is required — just open your browser. They're ideal for:
This is the most straightforward approach, suitable for most scenarios.
Data → Remove DuplicatesOK — Excel will report how many duplicates were removed
If you'd rather highlight duplicates for manual review instead of deleting them:
Home → Conditional Formatting → Highlight Cell Rules → Duplicate ValuesModern Excel offers the UNIQUE function, which extracts unique values without modifying the original data:
=UNIQUE(A2:A100) // Extract unique values
=UNIQUE(A2:A100, FALSE, TRUE) // Extract values that appear exactly once
For more flexible control, use the COUNTIF function to create a helper column:
=COUNTIF(A$2:A$100, A2) // Count occurrences within the range
// Result > 1 means it's a duplicate
Go to Data → Advanced → check Unique records only to copy unique values to a new location.
Similar to Excel: Data → Remove duplicates → select columns → confirm. The workflow is nearly identical.
Google Sheets also supports the UNIQUE function, and it's available in all versions:
=UNIQUE(A2:A) // Extract all unique values from column A
=UNIQUE(A2:C) // Multi-column deduplication
Leverage Google Sheets' powerful QUERY function:
=QUERY(A2:A, "SELECT A WHERE A != '' GROUP BY A", 0)
For scenarios that need regular automated dedup, you can write a script:
function removeDuplicates() {
var sheet = SpreadsheetApp.getActiveSpreadsheet();
var range = sheet.getDataRange();
var data = range.getValues();
var headers = data[0];
var uniqueData = [headers];
var seen = new Set();
for (var i = 1; i < data.length; i++) {
var key = data[i].join('|');
if (!seen.has(key)) {
seen.add(key);
uniqueData.push(data[i]);
}
}
sheet.clearContents();
sheet.getRange(1, 1, uniqueData.length, uniqueData[0].length)
.setValues(uniqueData);
}
For large-scale data or custom dedup logic, Python is the most flexible option.
# List dedup (preserve order)
def remove_duplicates(lst):
seen = set()
return [x for x in lst if not (x in seen or seen.add(x))]
data = ["apple", "banana", "apple", "cherry", "banana"]
result = remove_duplicates(data)
# ['apple', 'banana', 'cherry']
def case_insensitive_dedup(lst):
seen = set()
result = []
for item in lst:
key = item.strip().lower()
if key and key not in seen:
seen.add(key)
result.append(item)
return result
import pandas as pd
# Read CSV
df = pd.read_csv('data.csv')
# Exact duplicate removal
df.drop_duplicates(inplace=True)
# Dedup based on specific columns
df.drop_duplicates(subset=['email'], inplace=True)
# Keep last occurrence (default keeps first)
df.drop_duplicates(subset=['email'], keep='last', inplace=True)
# Export results
df.to_csv('cleaned_data.csv', index=False)
from fuzzywuzzy import fuzz, process
# Calculate string similarity (0-100)
score = fuzz.ratio("iPhone 15", "iphone15") # 95
# Find the closest match from a list
choices = ["iPhone 15 Pro", "iPhone 15", "iPhone 14"]
result = process.extractOne("iphone 15", choices)
# ('iPhone 15', 100)
Many records that "don't look like duplicates" actually are after normalization. Standardizing before dedup significantly improves results.
| Operation | Before | After |
|---|---|---|
| Trim whitespace | " apple " | "apple" |
| Normalize case | "Apple", "APPLE" | "apple" |
| Remove extra spaces | "apple banana" | "apple banana" |
| Standardize punctuation | "U.S.A.", "USA" | "USA" |
| Standardize format | "2024/01/15", "01-15-2024" | "2024-01-15" |
| Remove special characters | "+86-138-0000-1234" | "13800001234" |
The Risetop Dedup Tool has built-in options for case-insensitive matching and whitespace trimming, handling normalization and dedup in one step.
Dedup operations are irreversible (especially when directly deleting), so before proceeding:
When using the "Remove Duplicates" feature, Excel keeps the first occurrence and deletes subsequent ones — the order remains unchanged. To keep the last occurrence instead, sort by timestamp in descending order before dedup. The same applies to the UNIQUE function — it returns unique values in order of appearance.
Excel slows down or crashes with over 1 million rows. Use Python pandas (with chunksize for streaming) or a database (SQL DISTINCT/ROW_NUMBER()). For plain text files, the command line is extremely fast: sort file.txt | uniq > deduped.txt.
It depends on the implementation. Risetop's dedup tool runs entirely in the browser (using JavaScript Set/Map) — your data is never sent to any server. You can verify this by using it offline. For sensitive data (like customer info), we always recommend using local tools or programmatic approaches.
Text deduplication is one of the most fundamental and important operations in data cleaning. Mastering multiple methods lets you handle different work scenarios with ease. For simple list dedup, online tools are most convenient; for structured tabular data, Excel/Sheets built-in features work well; for large-scale or complex needs, Python is the most powerful choice.