How to Remove HTML Tags from Text: Complete Guide
Whether you're scraping web content, cleaning up data for analysis, or extracting readable text from an HTML email, stripping HTML tags is a common task that comes up in web development, data science, and content management. This guide covers every major method for removing HTML tags from text — from simple online tools to programming approaches — and includes advanced techniques for preserving links, formatting, and structure.
Why Remove HTML Tags?
HTML tags define the structure and presentation of web content, but they're noise when you only need the readable text. Common scenarios include:
- Web scraping: Extracting article content from fetched HTML pages
- Email processing: Converting HTML emails to plain text for display or analysis
- Content migration: Moving content between CMS platforms
- Search indexing: Building search indexes from HTML documents
- Data cleaning: Preparing text data for NLP and machine learning
- Accessibility: Generating text alternatives from rich content
Method 1: Use an Online HTML to Text Converter
The fastest way to strip HTML tags is to use our free HTML to Text Converter. Paste your HTML code, click convert, and get clean plain text instantly. The tool handles nested tags, HTML entities, and script/style content removal automatically.
Method 2: Regular Expressions
Regular expressions offer a quick way to strip tags, though they have limitations with malformed HTML or nested structures.
JavaScript
text.replace(/<[^>]*>/g, '')
This removes all content between angle brackets, including the brackets themselves.
Python
import re
re.sub(r'<[^>]+>', '', html_text)
Same logic as JavaScript — matches any tag pattern and replaces with empty string.
Warning: Regex-based approaches can break with malformed HTML, comments (<!-- ... -->), or <script>/<style> content that you may want to remove entirely (including the content inside the tags).
Method 3: Using Built-in Browser APIs
Browsers provide a DOM-based approach that's more reliable than regex:
function stripHTML(html) {
const div = document.createElement('div');
div.innerHTML = html;
return div.textContent || div.innerText;
}This method uses the browser's built-in HTML parser, which correctly handles malformed HTML, decodes entities, and ignores script/style content. It's the recommended approach for client-side JavaScript.
Method 4: Python with BeautifulSoup
For server-side HTML parsing, BeautifulSoup is the gold standard:
from bs4 import BeautifulSoup
text = BeautifulSoup(html, 'html.parser').get_text(separator=' ', strip=True)The separator parameter adds spaces between elements that would otherwise be concatenated (like "HelloWorld" from adjacent elements), and strip=True removes leading/trailing whitespace.
Method 5: Command Line Tools
On Linux/macOS, you can strip tags using command-line tools:
sed 's/<[^>]*>//g' input.html
Quick and dirty — uses sed regex replacement.
python3 -c "from bs4 import BeautifulSoup; import sys; print(BeautifulSoup(sys.stdin.read(),'html.parser').get_text())" < input.html
More robust — uses BeautifulSoup for proper parsing.
Advanced: Preserving Links
Sometimes you want to strip HTML but keep the URLs from links. This is useful for converting HTML articles to Markdown or plain text with clickable references:
Python (BeautifulSoup)
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for a in soup.find_all('a'):
href = a.get('href', '')
a.replace_with(f'{a.get_text()} [{href}]')
text = soup.get_text(strip=True)This converts <a href="https://example.com">Click here</a> to Click here [https://example.com].
JavaScript
html.replace(/<a\s+href="([^"]*)"*)>([^<]*)<\/a>/gi, '$2 [$1]')
Simple regex approach — less robust than DOM-based methods but works for well-formed HTML.
Advanced: Preserving Formatting
If you need to preserve some formatting hints while stripping tags:
- Convert
<h1>–<h6>to Markdown headings (#,##, etc.) - Convert
<strong>/<b>to**bold** - Convert
<em>/<i>to*italic* - Convert
<ul>/<ol>to Markdown lists - Convert
<br>to newlines
Libraries like html2text (Python) or turndown (JavaScript) handle these conversions automatically. They produce Markdown output that preserves the document's structure without HTML tags.
Handling HTML Entities
When stripping HTML, you'll often encounter entities like &, <, , and ©. These need to be decoded to their character equivalents:
- JavaScript: Create a text node:
const el = document.createElement('textarea'); el.innerHTML = text; return el.value; - Python:
import html; html.unescape(text) - Online tool: Our HTML to Text Converter handles entity decoding automatically
Common Pitfalls
- Script and style content: Removing the tags but not the content leaves JavaScript and CSS code in your text. Always remove
<script>*and<style>*blocks entirely. - Adjacent elements:
<span>Hello</span><span>World</span>becomes "HelloWorld" without a separator. Useget_text(separator=' ')in BeautifulSoup. - Malformed HTML: Regex-based methods can fail with unclosed tags, nested quotes, or HTML inside attribute values. Use a proper parser when reliability matters.
- Preserving line breaks:
<br>and<p>tags create visual breaks. Without handling them, all text runs together into a single paragraph.
Related Tools
- HTML Formatter: Beautify and format HTML code with proper indentation
- Markdown to HTML: Convert Markdown text to HTML — the reverse process
- HTML to Text Converter: Our free online tool for instant tag stripping
FAQ
How do I remove HTML tags from text?
The easiest way is to use our free HTML to Text Converter — paste your HTML and get clean text instantly. For programming, use JavaScript's DOM API (element.textContent), Python's BeautifulSoup (get_text()), or a regex like <[^>]*> for simple cases.
Can I remove HTML tags but keep the links?
Yes. Using BeautifulSoup in Python, you can iterate through all <a> tags, extract the URL from the href attribute, and replace the tag with text like link text [url]. This preserves hyperlink information while removing all HTML markup.
Is it safe to use regex for stripping HTML tags?
Regex works for simple, well-formed HTML but can fail with malformed markup, comments, script tags, or nested structures. For reliable results, use a proper HTML parser like BeautifulSoup (Python), DOMParser (JavaScript), or html2text. Reserve regex for quick-and-dirty tasks where edge cases don't matter.
How do I remove HTML tags in Excel?
In Excel, you can use a formula with multiple SUBSTITUTE calls, but this only handles known tags. A more practical approach is to use Power Query's "Extract Values" feature after parsing the HTML, or copy the HTML into our online converter and paste the result back.
What happens to HTML entities when stripping tags?
HTML entities like & → &, < → <, and → non-breaking space remain as entity codes unless explicitly decoded. Our converter and proper parsers (BeautifulSoup, DOM API) decode entities automatically. If using regex, you'll need a separate entity decoding step.
How do I convert HTML to Markdown?
Use a library like turndown.js (JavaScript) or html2text (Python). These tools parse HTML structure and convert headings, lists, links, bold/italic, and other elements to their Markdown equivalents. Alternatively, you can use our Markdown to HTML tool for the reverse conversion.
Can I remove specific tags only?
Yes. In BeautifulSoup: for tag in soup.find_all('span'): tag.unwrap() removes only <span> tags while keeping their content. In regex, target specific tags: re.sub(r'<\/?span[^>]*>', '', html). Our HTML Formatter can also help inspect and selectively clean HTML.
Why does my stripped text have missing spaces?
This happens when text nodes are separated only by HTML tags without whitespace. For example, <div>Hello</div><div>World</div> becomes "HelloWorld" because there's no space between the tags. Fix this by using a parser that supports separators (BeautifulSoup's get_text(separator=' ')) or by adding a space after each closing tag during replacement.