Character Encoding for Web Developers

A comprehensive guide to character encoding in web development. Learn why "café" becomes "cafÃ©", how to fix encoding issues, and master UTF-8, URL encoding, Base64, and HTML entities.

TL;DR: Always use UTF-8 everywhere (database, HTML, APIs, files). Use URL encoding for URLs, HTML entities for HTML display, and Base64 for binary data transport.

What is Character Encoding?

Character encoding is how computers translate characters (letters, symbols, emojis) into bytes. Different encoding systems use different translation rules, which is why you sometimes see "garbage" characters like � or Ã© when systems disagree.

// The word "café" in different encodings:
UTF-8:       63 61 66 C3 A9      (5 bytes)
ISO-8859-1:  63 61 66 E9         (4 bytes)
Windows-1252: 63 61 66 E9        (4 bytes)

// Same bytes, wrong interpretation = mojibake (garbled text)
UTF-8 "café" read as ISO-8859-1 → "cafÃ©"

UTF-8: The Universal Standard

UTF-8 (Unicode Transformation Format - 8-bit) is the dominant character encoding on the web. Over 98% of websites use UTF-8 because it:

Supports all languages: English, Chinese, Arabic, emoji, mathematical symbols - everything
Backward compatible with ASCII: ASCII characters (A-Z, 0-9) use 1 byte in both
Variable length: 1-4 bytes per character (efficient for English, works for all scripts)
Self-synchronizing: You can detect character boundaries even if you start mid-stream

UTF-8 in Practice

// HTML - Always declare UTF-8
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>Your Page</title>
</head>

// JavaScript - Files should be saved as UTF-8
const greeting = "Hello 世界 🌍" // Works perfectly in UTF-8

// Node.js - Default is UTF-8 (be explicit if needed)
const fs = require('fs')
fs.writeFileSync('file.txt', 'café', 'utf8')

// Database - MySQL example
CREATE TABLE users (
  name VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
);

// Note: utf8mb4 is MySQL's "real" UTF-8 (supports 4-byte characters like emoji)

MySQL gotcha: MySQL's "utf8" charset only supports 3 bytes (no emoji!). Always use utf8mb4 for full UTF-8 support including emoji.

Common Encoding Issues and Fixes

1. Mojibake: "café" Becomes "cafÃ©"

Cause: Data stored as UTF-8 but read as ISO-8859-1 (Latin-1) or vice versa.

// Problem: Database stores UTF-8, but connection uses Latin-1
// Fix: Set connection charset
const mysql = require('mysql2')
const connection = mysql.createConnection({
  charset: 'utf8mb4' // Always specify UTF-8
})

// PHP Fix
$pdo = new PDO('mysql:host=localhost;dbname=mydb;charset=utf8mb4');

// Python Fix
import pymysql
conn = pymysql.connect(charset='utf8mb4')

2. Question Mark Diamonds: � (Replacement Character)

Cause: System encounters bytes it can't decode. Unicode replacement character U+FFFD appears.

// Typical scenario: copying text from Word into a Latin-1 form
// Smart quotes " " (UTF-8) → � � (Latin-1 can't handle them)

// Fix 1: Ensure form accepts UTF-8
<form accept-charset="UTF-8">

// Fix 2: Sanitize input (convert smart quotes to regular quotes)
const sanitize = (text) => {
  return text
    .replace(/[“”]/g, '"')  // Smart double quotes
    .replace(/[‘’]/g, "'")  // Smart single quotes
    .replace(/–/g, '-')          // En dash
    .replace(/—/g, '--')         // Em dash
}

3. URL Encoding Errors

Cause: Special characters in URLs not properly encoded.

// ❌ WRONG - Breaks when special characters present
const url = 'https://api.example.com/search?q=' + userInput

// ✓ CORRECT - Always encode URL components
const url = 'https://api.example.com/search?q=' + encodeURIComponent(userInput)

// Examples:
encodeURIComponent("Hello World")    → "Hello%20World"
encodeURIComponent("user@email.com") → "user%40email.com"
encodeURIComponent("price=$100")     → "price%3D%24100"

// Use encodeURI for full URLs (preserves :// and /)
encodeURI("https://example.com/path with spaces") 
  → "https://example.com/path%20with%20spaces"

HTML Entity Encoding

HTML has reserved characters (<, >, &, ") that must be encoded to display literally or to prevent XSS attacks when showing user input.

Named vs Numeric Entities

// Named entities (easier to read)
<  → &lt;
>  → &gt;
&  → &amp;
"  → &quot;
'  → &apos; (or &#39;)

// Numeric entities (works for any Unicode character)
©  → &#169;   (decimal)
©  → &#xA9;   (hexadecimal)
🙂 → &#128578;

// Example: Display HTML code as text
<pre>&lt;div class="example"&gt;Content&lt;/div&gt;</pre>

XSS Prevention with HTML Encoding

// ❌ DANGEROUS - Never put user input directly in HTML
const comment = req.body.comment
html = `<div>${comment}</div>`

// ✓ SAFE - Encode HTML entities
function escapeHtml(text) {
  return text
    .replace(/&/g, '&amp;')
    .replace(/</g, '&lt;')
    .replace(/>/g, '&gt;')
    .replace(/"/g, '&quot;')
    .replace(/'/g, '&#x27;')
}

const safeComment = escapeHtml(req.body.comment)
html = `<div>${safeComment}</div>`

// Or use a library
import { escape } from 'lodash'
import DOMPurify from 'dompurify'

// React does this automatically for JSX text content (but not dangerouslySetInnerHTML)

Security critical: Always HTML-encode user input before displaying. Input like <script>alert('XSS')</script> must become<script>... to prevent script execution.

Base64 Encoding

Base64 converts binary data to ASCII text using 64 characters (A-Z, a-z, 0-9, +, /). Used for embedding binary data in text formats like JSON, XML, HTML, or email.

When to Use Base64

Data URIs: Embed images in CSS/HTML without external files
JSON APIs: Send binary data (images, PDFs) in JSON
Email attachments: MIME encoding (legacy but still used)
Storing binary in databases: When BLOB columns aren't available

// JavaScript Base64 encoding/decoding
const text = "Hello World"
const encoded = btoa(text)      // "SGVsbG8gV29ybGQ="
const decoded = atob(encoded)   // "Hello World"

// For Unicode text, use this pattern:
const encodeUTF8 = (str) => btoa(unescape(encodeURIComponent(str)))
const decodeUTF8 = (str) => decodeURIComponent(escape(atob(str)))

// Modern browsers support TextEncoder (better):
const encodeUTF8Modern = (str) => {
  const bytes = new TextEncoder().encode(str)
  return btoa(String.fromCharCode(...bytes))
}

// Data URI example (embed image in HTML)
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUA..." />

// Node.js Base64
const base64 = Buffer.from("Hello World").toString('base64')
const text = Buffer.from(base64, 'base64').toString('utf8')

Important: Base64 increases data size by ~33%. A 1MB image becomes ~1.33MB when Base64-encoded. Consider direct binary transfer for large files.

URL Encoding (Percent Encoding)

URLs can only contain certain "safe" ASCII characters. Special characters must be percent-encoded as %XX where XX is the hexadecimal byte value.

Safe vs Unsafe Characters

// Safe characters (no encoding needed):
A-Z a-z 0-9 - _ . ~

// Must be encoded:
Space   → %20  (or + in query strings)
!       → %21
"       → %22
#       → %23  (fragment identifier)
$       → %24
%       → %25  (escape character itself)
&       → %26  (query parameter separator)
'       → %27
(       → %28
)       → %29
=       → %3D  (key-value separator)
?       → %3F  (query string start)
@       → %40
[       → %5B
]       → %5D
{       → %7B
}       → %7D

JavaScript URL Encoding Functions

// encodeURIComponent - Use for query parameters, form data
encodeURIComponent("Hello World!")  → "Hello%20World%21"
encodeURIComponent("user@email.com") → "user%40email.com"

// Use case: Building query strings
const params = new URLSearchParams({
  search: "Node.js & React",
  category: "Web Development"
}).toString()
// → "search=Node.js+%26+React&category=Web+Development"

// encodeURI - Use for full URLs (preserves :, /, ?, #)
encodeURI("https://example.com/search?q=hello world")
// → "https://example.com/search?q=hello%20world"

// ❌ Never use escape() - Deprecated, doesn't handle Unicode properly

Common Encoding Scenarios in Web Development

1. Sending Form Data with Special Characters

// URL-encoded form submission (application/x-www-form-urlencoded)
const formData = {
  username: "user@example.com",
  bio: "Web developer & designer 🚀"
}

const encoded = new URLSearchParams(formData).toString()
// → "username=user%40example.com&bio=Web+developer+%26+designer+%F0%9F%9A%80"

// Sending to API
fetch('/api/profile', {
  method: 'POST',
  headers: { 'Content-Type': 'application/x-www-form-urlencoded' },
  body: encoded
})

2. Storing User-Generated Content

// Database: Store as UTF-8
// Display in HTML: Escape HTML entities
// Send in JSON: UTF-8 is valid in JSON strings

// Example: Blog comment system
// 1. Receive from user
const comment = req.body.comment

// 2. Store in database (UTF-8)
await db.query('INSERT INTO comments (text) VALUES (?)', [comment])

// 3. Retrieve from database
const rows = await db.query('SELECT text FROM comments')

// 4. Display in HTML (escape HTML entities)
const safeHtml = escapeHtml(rows[0].text)
res.send(`<div class="comment">${safeHtml}</div>`)

// 5. Send in JSON API (UTF-8 is fine)
res.json({ comment: rows[0].text })

3. Working with Filenames

// User uploads "résumé-2024.pdf"
// Store on disk: URL-encode or use safe filesystem names
const sanitizeFilename = (name) => {
  return name
    .normalize('NFD')                    // Decompose accented characters
    .replace(/[̀-ͯ]/g, '')    // Remove diacritics
    .replace(/[^a-zA-Z0-9.-]/g, '_')    // Replace unsafe chars
}

sanitizeFilename("résumé-2024.pdf") → "resume-2024.pdf"

// Or preserve Unicode with proper encoding
const safeFilename = encodeURIComponent(filename)

// Download header (Content-Disposition)
res.setHeader('Content-Disposition', 
  'attachment; filename="' + sanitizeFilename(name) + '"; ' +
  'filename*=UTF-8\'\'' + encodeURIComponent(name))
// RFC 5987: filename* supports UTF-8 filenames

Debugging Encoding Issues

Tools and Techniques

// 1. Check byte representation
const text = "café"
console.log([...text].map(c => c.charCodeAt(0).toString(16)))
// UTF-8: ['63', '61', '66', 'e9'] for "café"

// 2. Detect encoding (Node.js)
const chardet = require('chardet')
const encoding = chardet.detectFileSync('mystery.txt')
console.log(encoding) // e.g., "UTF-8" or "ISO-8859-1"

// 3. Convert encoding (Node.js)
const iconv = require('iconv-lite')
const latin1Buffer = Buffer.from([0x63, 0x61, 0x66, 0xE9]) // "café" in Latin-1
const utf8Text = iconv.decode(latin1Buffer, 'latin1')
console.log(utf8Text) // "café" (correctly decoded)

// 4. Check HTTP headers
// Browser DevTools → Network → Response Headers
// Look for: Content-Type: text/html; charset=UTF-8

Common Red Flags

Seeing � (replacement character) = Encoding mismatch or corrupted data
"caf茅" = Double-encoded UTF-8 (encoded UTF-8 re-encoded as UTF-8)
Question marks in database but correct in application = DB charset wrong

Best Practices Checklist

For All Projects

✓ Use UTF-8 everywhere (HTML, database, files, APIs)
✓ Declare charset in HTML: <meta charset="UTF-8">
✓ Save source files as UTF-8 (not UTF-8 with BOM)
✓ Test with non-ASCII characters early (café, 日本語, emoji 🎉)

For Databases

✓ MySQL: Use utf8mb4, not utf8
✓ PostgreSQL: Default UTF-8 encoding is good (check with SHOW SERVER_ENCODING)
✓ Set connection charset explicitly in code
✓ Use COLLATE utf8mb4_unicode_ci for case-insensitive sorting

For URLs and Forms

✓ Use encodeURIComponent() for query parameters
✓ Use encodeURI() for full URLs
✓ Add accept-charset="UTF-8" to forms
✓ Validate and sanitize user input before display

Related Tools

Encoders/Decoders

Base64 Encoder/Decoder URL Encoder/Decoder HTML Escape/Unescape

Text Tools

Case Converter Text Diff Checker

Developer Guides

JSON vs YAML Hashing & Encoding

Summary

UTF-8 is the standard - Use it for everything unless you have a specific legacy requirement
URL encode query parameters - Use encodeURIComponent() for user input in URLs
HTML encode user content - Prevent XSS by escaping <, >, &, ", '
Base64 for binary in text - When you need to embed images, files, or binary data in JSON/HTML/CSS
MySQL: use utf8mb4 - Not "utf8" (which doesn't support emoji)

Proper character encoding prevents data corruption, display issues, and security vulnerabilities. When in doubt: UTF-8 everywhere, encode for context (URLs, HTML, Base64), and test with international characters early in development.

CodingTool