Character Encoding for Web Developers
A comprehensive guide to character encoding in web development. Learn why "café" becomes "café", how to fix encoding issues, and master UTF-8, URL encoding, Base64, and HTML entities.
What is Character Encoding?
Character encoding is how computers translate characters (letters, symbols, emojis) into bytes. Different encoding systems use different translation rules, which is why you sometimes see "garbage" characters like � or é when systems disagree.
// The word "café" in different encodings:
UTF-8: 63 61 66 C3 A9 (5 bytes)
ISO-8859-1: 63 61 66 E9 (4 bytes)
Windows-1252: 63 61 66 E9 (4 bytes)
// Same bytes, wrong interpretation = mojibake (garbled text)
UTF-8 "café" read as ISO-8859-1 → "café"UTF-8: The Universal Standard
UTF-8 (Unicode Transformation Format - 8-bit) is the dominant character encoding on the web. Over 98% of websites use UTF-8 because it:
- Supports all languages: English, Chinese, Arabic, emoji, mathematical symbols - everything
- Backward compatible with ASCII: ASCII characters (A-Z, 0-9) use 1 byte in both
- Variable length: 1-4 bytes per character (efficient for English, works for all scripts)
- Self-synchronizing: You can detect character boundaries even if you start mid-stream
UTF-8 in Practice
// HTML - Always declare UTF-8
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Your Page</title>
</head>
// JavaScript - Files should be saved as UTF-8
const greeting = "Hello 世界 🌍" // Works perfectly in UTF-8
// Node.js - Default is UTF-8 (be explicit if needed)
const fs = require('fs')
fs.writeFileSync('file.txt', 'café', 'utf8')
// Database - MySQL example
CREATE TABLE users (
name VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
);
// Note: utf8mb4 is MySQL's "real" UTF-8 (supports 4-byte characters like emoji)utf8mb4 for full UTF-8 support including emoji.Common Encoding Issues and Fixes
1. Mojibake: "café" Becomes "café"
Cause: Data stored as UTF-8 but read as ISO-8859-1 (Latin-1) or vice versa.
// Problem: Database stores UTF-8, but connection uses Latin-1
// Fix: Set connection charset
const mysql = require('mysql2')
const connection = mysql.createConnection({
charset: 'utf8mb4' // Always specify UTF-8
})
// PHP Fix
$pdo = new PDO('mysql:host=localhost;dbname=mydb;charset=utf8mb4');
// Python Fix
import pymysql
conn = pymysql.connect(charset='utf8mb4')2. Question Mark Diamonds: � (Replacement Character)
Cause: System encounters bytes it can't decode. Unicode replacement character U+FFFD appears.
// Typical scenario: copying text from Word into a Latin-1 form
// Smart quotes " " (UTF-8) → � � (Latin-1 can't handle them)
// Fix 1: Ensure form accepts UTF-8
<form accept-charset="UTF-8">
// Fix 2: Sanitize input (convert smart quotes to regular quotes)
const sanitize = (text) => {
return text
.replace(/[“”]/g, '"') // Smart double quotes
.replace(/[‘’]/g, "'") // Smart single quotes
.replace(/–/g, '-') // En dash
.replace(/—/g, '--') // Em dash
}3. URL Encoding Errors
Cause: Special characters in URLs not properly encoded.
// ❌ WRONG - Breaks when special characters present
const url = 'https://api.example.com/search?q=' + userInput
// ✓ CORRECT - Always encode URL components
const url = 'https://api.example.com/search?q=' + encodeURIComponent(userInput)
// Examples:
encodeURIComponent("Hello World") → "Hello%20World"
encodeURIComponent("user@email.com") → "user%40email.com"
encodeURIComponent("price=$100") → "price%3D%24100"
// Use encodeURI for full URLs (preserves :// and /)
encodeURI("https://example.com/path with spaces")
→ "https://example.com/path%20with%20spaces"HTML Entity Encoding
HTML has reserved characters (<, >, &, ") that must be encoded to display literally or to prevent XSS attacks when showing user input.
Named vs Numeric Entities
// Named entities (easier to read)
< → <
> → >
& → &
" → "
' → ' (or ')
// Numeric entities (works for any Unicode character)
© → © (decimal)
© → © (hexadecimal)
🙂 → 🙂
// Example: Display HTML code as text
<pre><div class="example">Content</div></pre>XSS Prevention with HTML Encoding
// ❌ DANGEROUS - Never put user input directly in HTML
const comment = req.body.comment
html = `<div>${comment}</div>`
// ✓ SAFE - Encode HTML entities
function escapeHtml(text) {
return text
.replace(/&/g, '&')
.replace(/</g, '<')
.replace(/>/g, '>')
.replace(/"/g, '"')
.replace(/'/g, ''')
}
const safeComment = escapeHtml(req.body.comment)
html = `<div>${safeComment}</div>`
// Or use a library
import { escape } from 'lodash'
import DOMPurify from 'dompurify'
// React does this automatically for JSX text content (but not dangerouslySetInnerHTML)<script>alert('XSS')</script> must become<script>... to prevent script execution.Base64 Encoding
Base64 converts binary data to ASCII text using 64 characters (A-Z, a-z, 0-9, +, /). Used for embedding binary data in text formats like JSON, XML, HTML, or email.
When to Use Base64
- Data URIs: Embed images in CSS/HTML without external files
- JSON APIs: Send binary data (images, PDFs) in JSON
- Email attachments: MIME encoding (legacy but still used)
- Storing binary in databases: When BLOB columns aren't available
// JavaScript Base64 encoding/decoding
const text = "Hello World"
const encoded = btoa(text) // "SGVsbG8gV29ybGQ="
const decoded = atob(encoded) // "Hello World"
// For Unicode text, use this pattern:
const encodeUTF8 = (str) => btoa(unescape(encodeURIComponent(str)))
const decodeUTF8 = (str) => decodeURIComponent(escape(atob(str)))
// Modern browsers support TextEncoder (better):
const encodeUTF8Modern = (str) => {
const bytes = new TextEncoder().encode(str)
return btoa(String.fromCharCode(...bytes))
}
// Data URI example (embed image in HTML)
<img src="..." />
// Node.js Base64
const base64 = Buffer.from("Hello World").toString('base64')
const text = Buffer.from(base64, 'base64').toString('utf8')URL Encoding (Percent Encoding)
URLs can only contain certain "safe" ASCII characters. Special characters must be percent-encoded as %XX where XX is the hexadecimal byte value.
Safe vs Unsafe Characters
// Safe characters (no encoding needed):
A-Z a-z 0-9 - _ . ~
// Must be encoded:
Space → %20 (or + in query strings)
! → %21
" → %22
# → %23 (fragment identifier)
$ → %24
% → %25 (escape character itself)
& → %26 (query parameter separator)
' → %27
( → %28
) → %29
= → %3D (key-value separator)
? → %3F (query string start)
@ → %40
[ → %5B
] → %5D
{ → %7B
} → %7DJavaScript URL Encoding Functions
// encodeURIComponent - Use for query parameters, form data
encodeURIComponent("Hello World!") → "Hello%20World%21"
encodeURIComponent("user@email.com") → "user%40email.com"
// Use case: Building query strings
const params = new URLSearchParams({
search: "Node.js & React",
category: "Web Development"
}).toString()
// → "search=Node.js+%26+React&category=Web+Development"
// encodeURI - Use for full URLs (preserves :, /, ?, #)
encodeURI("https://example.com/search?q=hello world")
// → "https://example.com/search?q=hello%20world"
// ❌ Never use escape() - Deprecated, doesn't handle Unicode properlyCommon Encoding Scenarios in Web Development
1. Sending Form Data with Special Characters
// URL-encoded form submission (application/x-www-form-urlencoded)
const formData = {
username: "user@example.com",
bio: "Web developer & designer 🚀"
}
const encoded = new URLSearchParams(formData).toString()
// → "username=user%40example.com&bio=Web+developer+%26+designer+%F0%9F%9A%80"
// Sending to API
fetch('/api/profile', {
method: 'POST',
headers: { 'Content-Type': 'application/x-www-form-urlencoded' },
body: encoded
})2. Storing User-Generated Content
// Database: Store as UTF-8
// Display in HTML: Escape HTML entities
// Send in JSON: UTF-8 is valid in JSON strings
// Example: Blog comment system
// 1. Receive from user
const comment = req.body.comment
// 2. Store in database (UTF-8)
await db.query('INSERT INTO comments (text) VALUES (?)', [comment])
// 3. Retrieve from database
const rows = await db.query('SELECT text FROM comments')
// 4. Display in HTML (escape HTML entities)
const safeHtml = escapeHtml(rows[0].text)
res.send(`<div class="comment">${safeHtml}</div>`)
// 5. Send in JSON API (UTF-8 is fine)
res.json({ comment: rows[0].text })3. Working with Filenames
// User uploads "résumé-2024.pdf"
// Store on disk: URL-encode or use safe filesystem names
const sanitizeFilename = (name) => {
return name
.normalize('NFD') // Decompose accented characters
.replace(/[̀-ͯ]/g, '') // Remove diacritics
.replace(/[^a-zA-Z0-9.-]/g, '_') // Replace unsafe chars
}
sanitizeFilename("résumé-2024.pdf") → "resume-2024.pdf"
// Or preserve Unicode with proper encoding
const safeFilename = encodeURIComponent(filename)
// Download header (Content-Disposition)
res.setHeader('Content-Disposition',
'attachment; filename="' + sanitizeFilename(name) + '"; ' +
'filename*=UTF-8\'\'' + encodeURIComponent(name))
// RFC 5987: filename* supports UTF-8 filenamesDebugging Encoding Issues
Tools and Techniques
// 1. Check byte representation
const text = "café"
console.log([...text].map(c => c.charCodeAt(0).toString(16)))
// UTF-8: ['63', '61', '66', 'e9'] for "café"
// 2. Detect encoding (Node.js)
const chardet = require('chardet')
const encoding = chardet.detectFileSync('mystery.txt')
console.log(encoding) // e.g., "UTF-8" or "ISO-8859-1"
// 3. Convert encoding (Node.js)
const iconv = require('iconv-lite')
const latin1Buffer = Buffer.from([0x63, 0x61, 0x66, 0xE9]) // "café" in Latin-1
const utf8Text = iconv.decode(latin1Buffer, 'latin1')
console.log(utf8Text) // "café" (correctly decoded)
// 4. Check HTTP headers
// Browser DevTools → Network → Response Headers
// Look for: Content-Type: text/html; charset=UTF-8Common Red Flags
- Seeing � (replacement character) = Encoding mismatch or corrupted data
- "café" instead of "café" = UTF-8 data read as Latin-1
- "caf茅" = Double-encoded UTF-8 (encoded UTF-8 re-encoded as UTF-8)
- Question marks in database but correct in application = DB charset wrong
Best Practices Checklist
- ✓ Use UTF-8 everywhere (HTML, database, files, APIs)
- ✓ Declare charset in HTML:
<meta charset="UTF-8"> - ✓ Save source files as UTF-8 (not UTF-8 with BOM)
- ✓ Test with non-ASCII characters early (café, 日本語, emoji 🎉)
- ✓ MySQL: Use
utf8mb4, notutf8 - ✓ PostgreSQL: Default UTF-8 encoding is good (check with
SHOW SERVER_ENCODING) - ✓ Set connection charset explicitly in code
- ✓ Use
COLLATE utf8mb4_unicode_cifor case-insensitive sorting
- ✓ Use
encodeURIComponent()for query parameters - ✓ Use
encodeURI()for full URLs - ✓ Add
accept-charset="UTF-8"to forms - ✓ Validate and sanitize user input before display
Related Tools
Summary
- UTF-8 is the standard - Use it for everything unless you have a specific legacy requirement
- URL encode query parameters - Use
encodeURIComponent()for user input in URLs - HTML encode user content - Prevent XSS by escaping <, >, &, ", '
- Base64 for binary in text - When you need to embed images, files, or binary data in JSON/HTML/CSS
- Watch for mojibake - "café" means encoding mismatch; fix database/connection charset
- MySQL: use utf8mb4 - Not "utf8" (which doesn't support emoji)
Proper character encoding prevents data corruption, display issues, and security vulnerabilities. When in doubt: UTF-8 everywhere, encode for context (URLs, HTML, Base64), and test with international characters early in development.