How to Remove HTML Tags from Text: Complete Guide

Whether you're scraping web content, cleaning up data for analysis, or extracting readable text from an HTML email, stripping HTML tags is a common task that comes up in web development, data science, and content management. This guide covers every major method for removing HTML tags from text — from simple online tools to programming approaches — and includes advanced techniques for preserving links, formatting, and structure.

Why Remove HTML Tags?

HTML tags define the structure and presentation of web content, but they're noise when you only need the readable text. Common scenarios include:

Method 1: Use an Online HTML to Text Converter

The fastest way to strip HTML tags is to use our free HTML to Text Converter. Paste your HTML code, click convert, and get clean plain text instantly. The tool handles nested tags, HTML entities, and script/style content removal automatically.

Method 2: Regular Expressions

Regular expressions offer a quick way to strip tags, though they have limitations with malformed HTML or nested structures.

JavaScript

text.replace(/<[^>]*>/g, '')
This removes all content between angle brackets, including the brackets themselves.

Python

import re
re.sub(r'<[^>]+>', '', html_text)

Same logic as JavaScript — matches any tag pattern and replaces with empty string.

Warning: Regex-based approaches can break with malformed HTML, comments (<!-- ... -->), or <script>/<style> content that you may want to remove entirely (including the content inside the tags).

Method 3: Using Built-in Browser APIs

Browsers provide a DOM-based approach that's more reliable than regex:

function stripHTML(html) {
  const div = document.createElement('div');
  div.innerHTML = html;
  return div.textContent || div.innerText;
}

This method uses the browser's built-in HTML parser, which correctly handles malformed HTML, decodes entities, and ignores script/style content. It's the recommended approach for client-side JavaScript.

Method 4: Python with BeautifulSoup

For server-side HTML parsing, BeautifulSoup is the gold standard:

from bs4 import BeautifulSoup
text = BeautifulSoup(html, 'html.parser').get_text(separator=' ', strip=True)

The separator parameter adds spaces between elements that would otherwise be concatenated (like "HelloWorld" from adjacent elements), and strip=True removes leading/trailing whitespace.

Method 5: Command Line Tools

On Linux/macOS, you can strip tags using command-line tools:

sed 's/<[^>]*>//g' input.html
Quick and dirty — uses sed regex replacement.
python3 -c "from bs4 import BeautifulSoup; import sys; print(BeautifulSoup(sys.stdin.read(),'html.parser').get_text())" < input.html
More robust — uses BeautifulSoup for proper parsing.

Advanced: Preserving Links

Sometimes you want to strip HTML but keep the URLs from links. This is useful for converting HTML articles to Markdown or plain text with clickable references:

Python (BeautifulSoup)

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for a in soup.find_all('a'):
  href = a.get('href', '')
  a.replace_with(f'{a.get_text()} [{href}]')
text = soup.get_text(strip=True)

This converts <a href="https://example.com">Click here</a> to Click here [https://example.com].

JavaScript

html.replace(/<a\s+href="([^"]*)"*)>([^<]*)<\/a>/gi, '$2 [$1]')
Simple regex approach — less robust than DOM-based methods but works for well-formed HTML.

Advanced: Preserving Formatting

If you need to preserve some formatting hints while stripping tags:

Libraries like html2text (Python) or turndown (JavaScript) handle these conversions automatically. They produce Markdown output that preserves the document's structure without HTML tags.

Handling HTML Entities

When stripping HTML, you'll often encounter entities like &amp;, &lt;, &nbsp;, and &copy;. These need to be decoded to their character equivalents:

Common Pitfalls

Related Tools

FAQ

How do I remove HTML tags from text?

The easiest way is to use our free HTML to Text Converter — paste your HTML and get clean text instantly. For programming, use JavaScript's DOM API (element.textContent), Python's BeautifulSoup (get_text()), or a regex like <[^>]*> for simple cases.

Can I remove HTML tags but keep the links?

Yes. Using BeautifulSoup in Python, you can iterate through all <a> tags, extract the URL from the href attribute, and replace the tag with text like link text [url]. This preserves hyperlink information while removing all HTML markup.

Is it safe to use regex for stripping HTML tags?

Regex works for simple, well-formed HTML but can fail with malformed markup, comments, script tags, or nested structures. For reliable results, use a proper HTML parser like BeautifulSoup (Python), DOMParser (JavaScript), or html2text. Reserve regex for quick-and-dirty tasks where edge cases don't matter.

How do I remove HTML tags in Excel?

In Excel, you can use a formula with multiple SUBSTITUTE calls, but this only handles known tags. A more practical approach is to use Power Query's "Extract Values" feature after parsing the HTML, or copy the HTML into our online converter and paste the result back.

What happens to HTML entities when stripping tags?

HTML entities like &amp;&, &lt;<, and &nbsp; → non-breaking space remain as entity codes unless explicitly decoded. Our converter and proper parsers (BeautifulSoup, DOM API) decode entities automatically. If using regex, you'll need a separate entity decoding step.

How do I convert HTML to Markdown?

Use a library like turndown.js (JavaScript) or html2text (Python). These tools parse HTML structure and convert headings, lists, links, bold/italic, and other elements to their Markdown equivalents. Alternatively, you can use our Markdown to HTML tool for the reverse conversion.

Can I remove specific tags only?

Yes. In BeautifulSoup: for tag in soup.find_all('span'): tag.unwrap() removes only <span> tags while keeping their content. In regex, target specific tags: re.sub(r'<\/?span[^>]*>', '', html). Our HTML Formatter can also help inspect and selectively clean HTML.

Why does my stripped text have missing spaces?

This happens when text nodes are separated only by HTML tags without whitespace. For example, <div>Hello</div><div>World</div> becomes "HelloWorld" because there's no space between the tags. Fix this by using a parser that supports separators (BeautifulSoup's get_text(separator=' ')) or by adding a space after each closing tag during replacement.