Text Diff — How File Comparison Algorithms Work
Learn how diff algorithms compare text files, understand unified diff format, and see how tools like git diff work under the hood.
Introduction
Every developer uses diff tools daily — git diff, pull request reviews, merge conflict resolution. But how does a diff algorithm figure out which lines were added, removed, or changed between two files? And what does that cryptic @@ -15,7 +15,8 @@ notation actually mean?
What Is a Diff?
A diff is a representation of the differences between two pieces of text. Given an "old" version and a "new" version, a diff algorithm outputs a minimal set of changes that transforms the old into the new.
The simplest example:
Old:
Hello
World
New:
Hello
Beautiful
World
Diff:
Hello
+ Beautiful
World
Lines prefixed with + were added. Lines prefixed with - were removed. Unchanged lines provide context.
The Longest Common Subsequence (LCS)
Most diff algorithms are based on finding the Longest Common Subsequence between two sequences. The LCS is the longest sequence of elements that appear in both inputs, in the same order, but not necessarily contiguously.
For text diffing, the "elements" are lines. Given:
Old: [A, B, C, D, E] New: [A, C, D, F, E]
The LCS is [A, C, D, E]. Lines not in the LCS were either deleted from old (B) or added in new (F).
Finding the LCS is computationally expensive — a naive approach is $O(2^n)$. The classic dynamic programming solution is $O(n \times m)$ where n and m are the lengths of the two inputs.
The Myers Diff Algorithm
Eugene Myers published the foundational diff algorithm in 1986. Git uses a variant of this algorithm. The key insight: instead of finding the LCS directly, find the shortest edit script (SES) — the minimum number of insertions and deletions to transform one sequence into the other.
Myers' algorithm works on an "edit graph" where:
- Moving right means deleting a line from the old file
- Moving down means inserting a line from the new file
- Moving diagonally means the lines match (no edit needed)
The algorithm finds the shortest path from top-left to bottom-right, which corresponds to the minimal diff. Its time complexity is $O(n \times d)$ where d is the size of the minimum edit script. For similar files (small d), this is much faster than $O(n \times m)$.
Unified Diff Format
The most common diff output format. Here's a complete example:
--- a/config.json
+++ b/config.json
@@ -1,6 +1,7 @@
{
"name": "my-app",
- "version": "1.0.0",
+ "version": "1.1.0",
+ "description": "A sample app",
"main": "index.js",
"scripts": {
"start": "node index.js"
Breaking it down:
Header:
--- a/config.json— the old file+++ b/config.json— the new file
Hunk header:
@@ -1,6 +1,7 @@means:-1,6— starting at line 1 of the old file, showing 6 lines+1,7— starting at line 1 of the new file, showing 7 lines
Content:
- Lines starting with a space are unchanged (context)
- Lines starting with
-exist only in the old file (deleted) - Lines starting with
+exist only in the new file (added) - A line with both
-and+versions represents a modification
Context Lines
Diffs typically show 3 lines of unchanged context around each change. This serves two purposes:
- Human readability — context helps you understand where in the file the change occurs
- Patch application — when applying a diff as a patch, context lines help locate the correct position even if line numbers have shifted
You can control context with git diff -U<n> where n is the number of context lines. -U0 shows no context (just the changes), -U10 shows 10 lines of context.
Word-Level and Character-Level Diffs
Line-level diffs can be noisy when only a single word on a line changed:
- The quick brown fox jumps over the lazy dog.
+ The quick brown fox leaps over the lazy dog.
The entire line is marked as changed even though only one word differs. Word-level or character-level diffs highlight exactly what changed within the line:
The quick brown fox {-jumps-}{+leaps+} over the lazy dog.
git diff --word-diff provides this, and most visual diff tools (VS Code, GitHub) highlight intra-line differences automatically.
The text diff tool supports both line-level and highlighted inline differences.
Three-Way Merge
When two people edit the same file from a common ancestor, you need a three-way diff:
Base (ancestor)
/ \
Ours Theirs
\ /
Merged
The merge algorithm:
- Diff "base vs ours" to find our changes
- Diff "base vs theirs" to find their changes
- If changes are in different regions, apply both automatically
- If changes overlap (same lines modified differently), flag a merge conflict
This is exactly what git merge does. Conflicts are marked with:
<<<<<<< HEAD
our version of the line
=======
their version of the line
>>>>>>> feature-branch
Common Diff Use Cases
Code Review
Every pull request shows diffs. Reading diffs efficiently is a core development skill. Focus on:
- What changed (the
+and-lines) - Why it changed (the PR description and commit messages)
- What's around the change (context lines)
Configuration Changes
Diffing configuration files catches unintended changes. Before deploying a config update, diff it against production to verify only intended changes are present.
Data Validation
Comparing two data exports reveals discrepancies. JSON files can be diffed effectively if you first format them consistently (sorted keys, consistent indentation) — otherwise formatting differences create noise.
Debugging
"It worked yesterday, doesn't today." Diffing yesterday's code against today's immediately shows what changed, narrowing the debugging scope.
Performance Considerations
Diff algorithms slow down with:
- Very large files (>100MB) — consider splitting or sampling
- Very different files (few common lines) — the edit distance d approaches n, losing the Myers algorithm's advantage
- Binary files — line-based diff is meaningless; use specialized tools
For large files, the patience diff algorithm (used by git diff --patience) produces more readable output by anchoring on unique lines, at the cost of slightly longer computation.
Semantic vs Textual Diffs
Standard diffs are purely textual — they don't understand the language. A semantic diff understands that:
- function foo(a, b) {
+ function foo(b, a) {
...swapped function parameters, which may have widespread implications. Textual diff just sees two changed lines.
Tools like difftastic provide language-aware (AST-based) diffs that understand code structure, showing "parameter order changed" rather than "line changed."
Conclusion
Diff algorithms are one of those foundational tools that quietly power modern software development. Every git commit, every code review, every merge relies on efficiently computing the minimal set of changes between two texts.
Use the text diff tool for quick browser-based file comparisons. For structured data, the JSON formatter can normalize formatting before diffing. And when you need to verify file integrity rather than content differences, the hash generator can confirm whether two files are identical without examining every line.