Remove Duplicates Guide: Excel/Sheets Methods & Data Cleaning Tips

📅 April 10, 2026 📖 ~10 min read 🏷️ Data Processing · Productivity Tools

Duplicate data is one of the most common issues in data cleaning. Whether you're working with customer lists, product catalogs, email lists, or log data, duplicate entries lead to statistical bias, repeated messages, and wasted storage. Studies show that 5%-10% of enterprise database records are duplicates, and in organizations without proper cleaning workflows, this figure can reach 30%.

This guide systematically covers multiple deduplication methods — from the simplest online tools to Excel/Google Sheets operations, all the way to Python scripting — serving readers across different skill levels and use cases.

1. Understanding Types of Duplicates

Before diving in, it's important to understand which type of duplicate you're dealing with.

Exact Duplicates

Two or more rows where every field is identical. This is the simplest type — one command or button click handles it.

Partial Duplicates

Certain key fields match but others differ. For example, the same customer appears twice in a database with slightly different address details. These require more careful handling — you may need to decide which record to keep.

Fuzzy Duplicates

Data looks similar but isn't identical, often due to typos or formatting differences (case, spaces, punctuation). For example, "Beijing", "Bei jing", and "beijing" all refer to the same city.

TypeExampleDifficultyRecommended Method
Exact Duplicate"alice@email.com" appears 3 times⭐ EasyOnline Tool / Excel
Partial DuplicateSame ID, different addresses⭐⭐ MediumMulti-column dedup
Fuzzy Duplicate"iPhone 15" vs "iphone15"⭐⭐⭐ HardNormalization + fuzzy matching

2. Online Text Dedup Tools

For deduplicating plain text lists (keyword lists, URL lists, email lists), online tools are the fastest option. The Risetop Remove Duplicates Tool offers a zero-barrier experience:

Steps:
1. Paste your text into the input box (one entry per line)
2. Choose options: ignore case, trim whitespace, sort results
3. Click the "Remove Duplicates" button
4. Copy the deduplicated results

The advantage of online tools is that no software installation or syntax knowledge is required — just open your browser. They're ideal for:

3. Excel Deduplication Methods

Method 1: Built-in "Remove Duplicates" Feature

This is the most straightforward approach, suitable for most scenarios.

Steps:
1. Select the data range (including headers)
2. Click DataRemove Duplicates
3. In the dialog, select the columns to check for duplicates
4. Click OK — Excel will report how many duplicates were removed

Method 2: Conditional Formatting to Highlight Duplicates

If you'd rather highlight duplicates for manual review instead of deleting them:

Steps:
1. Select the data range
2. Click HomeConditional FormattingHighlight Cell RulesDuplicate Values
3. Choose a highlight style — duplicates will be automatically marked in red

Method 3: UNIQUE Function (Excel 365 / 2021)

Modern Excel offers the UNIQUE function, which extracts unique values without modifying the original data:

=UNIQUE(A2:A100)        // Extract unique values
=UNIQUE(A2:A100, FALSE, TRUE)  // Extract values that appear exactly once

Method 4: COUNTIF Helper Column

For more flexible control, use the COUNTIF function to create a helper column:

=COUNTIF(A$2:A$100, A2)   // Count occurrences within the range
// Result > 1 means it's a duplicate

Method 5: Advanced Filter

Go to DataAdvanced → check Unique records only to copy unique values to a new location.

4. Google Sheets Deduplication Methods

Method 1: Built-in "Remove Duplicates" Feature

Similar to Excel: DataRemove duplicates → select columns → confirm. The workflow is nearly identical.

Method 2: UNIQUE Function

Google Sheets also supports the UNIQUE function, and it's available in all versions:

=UNIQUE(A2:A)           // Extract all unique values from column A
=UNIQUE(A2:C)           // Multi-column deduplication

Method 3: QUERY Function

Leverage Google Sheets' powerful QUERY function:

=QUERY(A2:A, "SELECT A WHERE A != '' GROUP BY A", 0)

Method 4: Google Apps Script for Batch Dedup

For scenarios that need regular automated dedup, you can write a script:

function removeDuplicates() {
  var sheet = SpreadsheetApp.getActiveSpreadsheet();
  var range = sheet.getDataRange();
  var data = range.getValues();
  var headers = data[0];
  var uniqueData = [headers];
  var seen = new Set();
  
  for (var i = 1; i < data.length; i++) {
    var key = data[i].join('|');
    if (!seen.has(key)) {
      seen.add(key);
      uniqueData.push(data[i]);
    }
  }
  
  sheet.clearContents();
  sheet.getRange(1, 1, uniqueData.length, uniqueData[0].length)
       .setValues(uniqueData);
}

5. Python Deduplication

For large-scale data or custom dedup logic, Python is the most flexible option.

Basic Dedup

# List dedup (preserve order)
def remove_duplicates(lst):
    seen = set()
    return [x for x in lst if not (x in seen or seen.add(x))]

data = ["apple", "banana", "apple", "cherry", "banana"]
result = remove_duplicates(data)
# ['apple', 'banana', 'cherry']

Case-Insensitive Dedup

def case_insensitive_dedup(lst):
    seen = set()
    result = []
    for item in lst:
        key = item.strip().lower()
        if key and key not in seen:
            seen.add(key)
            result.append(item)
    return result

Using pandas for Tabular Data

import pandas as pd

# Read CSV
df = pd.read_csv('data.csv')

# Exact duplicate removal
df.drop_duplicates(inplace=True)

# Dedup based on specific columns
df.drop_duplicates(subset=['email'], inplace=True)

# Keep last occurrence (default keeps first)
df.drop_duplicates(subset=['email'], keep='last', inplace=True)

# Export results
df.to_csv('cleaned_data.csv', index=False)

Fuzzy Dedup (using fuzzywuzzy)

from fuzzywuzzy import fuzz, process

# Calculate string similarity (0-100)
score = fuzz.ratio("iPhone 15", "iphone15")  # 95

# Find the closest match from a list
choices = ["iPhone 15 Pro", "iPhone 15", "iPhone 14"]
result = process.extractOne("iphone 15", choices)
# ('iPhone 15', 100)

6. Data Normalization Before Dedup

Many records that "don't look like duplicates" actually are after normalization. Standardizing before dedup significantly improves results.

OperationBeforeAfter
Trim whitespace" apple ""apple"
Normalize case"Apple", "APPLE""apple"
Remove extra spaces"apple banana""apple banana"
Standardize punctuation"U.S.A.", "USA""USA"
Standardize format"2024/01/15", "01-15-2024""2024-01-15"
Remove special characters"+86-138-0000-1234""13800001234"

The Risetop Dedup Tool has built-in options for case-insensitive matching and whitespace trimming, handling normalization and dedup in one step.

7. Post-Dedup Verification

Dedup operations are irreversible (especially when directly deleting), so before proceeding:

  1. Back up your data: Copy the original before dedup
  2. Log the count: Compare row counts before and after to verify results
  3. Spot-check: Randomly sample a few removed records to confirm they were true duplicates
  4. Check business logic: Some "duplicates" may be intentional (e.g., the same customer placing orders at different times) — confirm business rules before dedup
→ Try the Remove Duplicates Tool

FAQ

Does removing duplicates in Excel change the data order?

When using the "Remove Duplicates" feature, Excel keeps the first occurrence and deletes subsequent ones — the order remains unchanged. To keep the last occurrence instead, sort by timestamp in descending order before dedup. The same applies to the UNIQUE function — it returns unique values in order of appearance.

How to handle dedup for large datasets (1M+ rows)?

Excel slows down or crashes with over 1 million rows. Use Python pandas (with chunksize for streaming) or a database (SQL DISTINCT/ROW_NUMBER()). For plain text files, the command line is extremely fast: sort file.txt | uniq > deduped.txt.

Are online dedup tools safe? Will my data be leaked?

It depends on the implementation. Risetop's dedup tool runs entirely in the browser (using JavaScript Set/Map) — your data is never sent to any server. You can verify this by using it offline. For sensitive data (like customer info), we always recommend using local tools or programmatic approaches.

Text deduplication is one of the most fundamental and important operations in data cleaning. Mastering multiple methods lets you handle different work scenarios with ease. For simple list dedup, online tools are most convenient; for structured tabular data, Excel/Sheets built-in features work well; for large-scale or complex needs, Python is the most powerful choice.