jsmanifest logojsmanifest

Input Sanitization: DOMPurify vs Manual Validation

Input Sanitization: DOMPurify vs Manual Validation

Learn when to use DOMPurify versus manual validation for input sanitization. Real-world examples comparing DOM-based sanitization with regex and custom validation approaches.

While I was looking over some legacy code from a project I inherited the other day, I came across something that made me cringe. The previous developer had built an entire comment system where user input was being validated with a handful of regex patterns before rendering to the page. Little did I know at the time, but this approach was riddled with vulnerabilities that could have exposed the entire application to XSS attacks.

I was once guilty of thinking that a few well-placed regex patterns were enough to keep malicious input at bay. I would strip out script tags, remove angle brackets, and call it a day. Looking back, I realize how naive that approach was. The world of web security is far more nuanced than simple pattern matching can handle.

Why Input Sanitization Matters in Modern Web Applications

Here's the reality: every single point where user input touches your application is a potential security vulnerability. Whether you're building a comment system, a rich text editor, or even a simple search form, you need to treat user input as hostile until proven otherwise.

I cannot stress this enough! Modern web applications are incredibly complex, and attackers have become increasingly sophisticated. They're not just trying {/* REMOVED: <script> */}alert('xss')</script> anymore. They're using obscure HTML attributes, SVG elements, and even CSS injection techniques to bypass naive sanitization attempts.

The financial and reputational cost of a security breach is astronomical. When I finally decided to dive deep into input sanitization, I realized that the question isn't whether you should sanitize input, but how you should approach it effectively.

Understanding the Difference: Validation vs Sanitization

Before we jump into comparing specific approaches, let's clarify something fundamental that confused me for years: validation and sanitization are not the same thing.

Validation is about rejection. You're checking if input meets specific criteria, and if it doesn't, you reject it entirely. Think of it like a bouncer at a club checking IDs. If your ID doesn't match the requirements, you're not getting in.

Sanitization is about transformation. You're taking potentially dangerous input and making it safe by removing or encoding the harmful parts while preserving the legitimate content. It's more like a security checkpoint at an airport where they remove dangerous items but let you through.

In other words, validation says "no" to bad input, while sanitization says "yes, but I'm going to clean this up first."

Security comparison visualization

DOMPurify: DOM-Based HTML Sanitization Explained

When I came across DOMPurify for the first time, I was skeptical. Another library promising to solve all my security problems? I'd seen plenty of those fail spectacularly. But DOMPurify is different, and here's why.

DOMPurify doesn't use regex or string manipulation to clean HTML. Instead, it leverages the browser's own HTML parser to create a DOM tree, then walks through that tree to remove dangerous elements and attributes. This is fascinating because it means DOMPurify understands HTML the same way browsers do, making it incredibly difficult to bypass with clever encoding tricks.

The library was designed by security researchers who understand the nuances of XSS attacks. It handles edge cases that would take you years to discover on your own: mutation XSS, DOM clobbering, dangling markup injection, and countless other attack vectors I'd never even heard of when I started.

Here's a basic example of DOMPurify in action:

import DOMPurify from 'dompurify';
 
// Dangerous user input
const userInput = `
  <div>
    Hello World!
    <img src=x onerror="alert('XSS')">
    <script>alert('More XSS')</script>
    <a href="javascript:alert('Even more XSS')">Click me</a>
  </div>
`;
 
// Sanitize it
const cleanHTML = DOMPurify.sanitize(userInput);
 
// Result: '<div>Hello World!<img src="x"><a>Click me</a></div>'
// All dangerous attributes and elements removed, safe content preserved

Wonderful! Notice how DOMPurify preserved the legitimate HTML structure while completely removing the dangerous parts. It didn't just strip out script tags—it also removed the onerror attribute and the {/* REMOVED: javascript: */} protocol from the link.

Manual Validation Approaches: Regex, Whitelists, and Custom Logic

Luckily we can learn from the mistakes I made with manual validation. Let me show you what I was doing wrong, and why it's so problematic.

Here's the kind of code I used to write:

function sanitizeInput(input: string): string {
  // Attempt 1: Remove script tags
  let cleaned = input.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');
  
  // Attempt 2: Remove event handlers
  cleaned = cleaned.replace(/on\w+\s*=\s*["'][^"']*["']/gi, '');
  
  // Attempt 3: Remove javascript: protocol
  cleaned = cleaned.replace(/javascript:/gi, '');
  
  // Attempt 4: Encode angle brackets
  cleaned = cleaned.replace(/</g, '&lt;').replace(/>/g, '&gt;');
  
  return cleaned;
}
 
// Test with malicious input
const maliciousInput = `<img src=x onerror="alert('XSS')">`;
const result = sanitizeInput(maliciousInput);
 
// This looks safe, but what about these bypasses?
const bypass1 = `<img src=x onerror=alert('XSS')>`; // No quotes
const bypass2 = `<img src=x onError="alert('XSS')">`; // Capital E
const bypass3 = `<svg/onload=alert('XSS')>`; // SVG element
const bypass4 = `<iframe src="javascript:alert('XSS')"></iframe>`; // iframe

The problem with this approach is that I was playing whack-a-mole with attack vectors. Every time I learned about a new XSS technique, I added another regex pattern. The code became increasingly complex and impossible to maintain, and I could never be confident that I'd covered all the edge cases.

Manual validation works better when you're dealing with structured, predictable input:

interface UserProfile {
  username: string;
  email: string;
  age: number;
}
 
function validateUserProfile(input: unknown): UserProfile | null {
  // This is appropriate for structured data
  if (typeof input !== 'object' || input === null) return null;
  
  const data = input as Record<string, unknown>;
  
  // Username: alphanumeric only, 3-20 characters
  const usernameRegex = /^[a-zA-Z0-9]{3,20}$/;
  if (typeof data.username !== 'string' || !usernameRegex.test(data.username)) {
    return null;
  }
  
  // Email: basic format check
  const emailRegex = /^[^\s@]+@[^\s@]+\.[^\s@]+$/;
  if (typeof data.email !== 'string' || !emailRegex.test(data.email)) {
    return null;
  }
  
  // Age: must be a number between 13 and 120
  if (typeof data.age !== 'number' || data.age < 13 || data.age > 120) {
    return null;
  }
  
  return {
    username: data.username,
    email: data.email,
    age: data.age
  };
}

This validation approach makes sense because we're dealing with simple, structured data where the expected format is well-defined. We're not trying to preserve complex HTML structures—we're simply accepting or rejecting specific patterns.

Code validation flowchart

DOMPurify vs Manual Validation: A Side-by-Side Comparison

After working with both approaches extensively, here's what I've learned about when to use each:

DOMPurify wins when:

  • You need to accept rich HTML content (comments, blog posts, messaging)
  • You're dealing with user-generated HTML that might contain formatting
  • You want comprehensive protection without maintaining complex regex patterns
  • You need to handle edge cases and obscure attack vectors
  • Your input format is unpredictable or complex

Manual validation wins when:

  • You're working with simple, structured data (usernames, emails, phone numbers)
  • You want to reject input outright rather than transform it
  • Your validation rules are straightforward and well-defined
  • You need fine-grained control over what's acceptable
  • Performance is critical and you're processing massive volumes

The key insight I realized is that these approaches aren't mutually exclusive. In fact, the most robust applications use both.

Implementing DOMPurify in React Applications

When I started using DOMPurify in React applications, I made a crucial mistake: I was sanitizing content on every render. This was not only inefficient but also violated React's principles around memoization and controlled updates.

Here's the right way to integrate DOMPurify into a React component:

import { useMemo } from 'react';
import DOMPurify from 'dompurify';
 
interface CommentProps {
  content: string;
  author: string;
}
 
function Comment({ content, author }: CommentProps) {
  // Sanitize once and memoize the result
  const sanitizedContent = useMemo(() => {
    return DOMPurify.sanitize(content, {
      ALLOWED_TAGS: ['b', 'i', 'em', 'strong', 'a', 'p', 'br'],
      ALLOWED_ATTR: ['href'],
    });
  }, [content]);
  
  return (
    <div className="comment">
      <div className="comment-author">{author}</div>
      <div 
        className="comment-content"
        dangerouslySetInnerHTML={{ __html: sanitizedContent }}
      />
    </div>
  );
}

Notice how I'm using useMemo to ensure sanitization only happens when the content actually changes. I'm also configuring DOMPurify with a whitelist of allowed tags and attributes, which provides an additional layer of security by being explicit about what's permitted.

When Manual Validation Makes More Sense

Despite DOMPurify's power, there are scenarios where manual validation is the pragmatic choice. When I was building a form that collected structured data like addresses and phone numbers, I realized that rejecting invalid input was far more appropriate than trying to sanitize it.

Consider a search query input. You don't want to allow HTML at all—you just want plain text. In this case, simple validation is perfect:

function sanitizeSearchQuery(query: string): string {
  // Remove all HTML tags and trim whitespace
  return query.replace(/<[^>]*>/g, '').trim().slice(0, 200);
}

This is fast, predictable, and doesn't require any external dependencies. The ROI on learning and implementing DOMPurify for this use case would be negative.

Building a Hybrid Sanitization Strategy

The approach I've settled on after years of trial and error is using the right tool for the right job. Here's my framework:

For rich content (comments, posts, descriptions): Use DOMPurify with strict configuration. Accept that users want formatting, but keep them in a safe sandbox.

For structured data (forms, profiles, settings): Use validation to reject malformed input. Be strict about what you accept and reject everything else.

For search and simple text: Strip all markup and enforce length limits. Don't overcomplicate it.

For API inputs: Validate structure and types at the API boundary, then sanitize rich content fields before storage or display.

This hybrid approach has saved me countless hours of debugging and has dramatically improved the security posture of every application I've worked on. The key is understanding that different types of input require different types of protection.

And that concludes the end of this post! I hope you found this valuable and look out for more in the future!