Text Diff — How File Comparison Algorithms Work

Learn how diff algorithms compare text files, understand unified diff format, and see how tools like git diff work under the hood.

Andreas · April 16, 2026 · 8 min read

development tutorial

Introduction

Every developer uses diff tools daily — git diff, pull request reviews, merge conflict resolution. But how does a diff algorithm figure out which lines were added, removed, or changed between two files? And what does that cryptic @@ -15,7 +15,8 @@ notation actually mean?

What Is a Diff?

A diff is a representation of the differences between two pieces of text. Given an "old" version and a "new" version, a diff algorithm outputs a minimal set of changes that transforms the old into the new.

The simplest example:

Old:

Hello
World

New:

Hello
Beautiful
World

Diff:

  Hello
+ Beautiful
  World

Lines prefixed with + were added. Lines prefixed with - were removed. Unchanged lines provide context.

The Longest Common Subsequence (LCS)

Most diff algorithms are based on finding the Longest Common Subsequence between two sequences. The LCS is the longest sequence of elements that appear in both inputs, in the same order, but not necessarily contiguously.

For text diffing, the "elements" are lines. Given:

Old: [A, B, C, D, E] New: [A, C, D, F, E]

The LCS is [A, C, D, E]. Lines not in the LCS were either deleted from old (B) or added in new (F).

Finding the LCS is computationally expensive — a naive approach is $O(2^n)$. The classic dynamic programming solution is $O(n \times m)$ where n and m are the lengths of the two inputs.

The Myers Diff Algorithm

Eugene Myers published the foundational diff algorithm in 1986. Git uses a variant of this algorithm. The key insight: instead of finding the LCS directly, find the shortest edit script (SES) — the minimum number of insertions and deletions to transform one sequence into the other.

Myers' algorithm works on an "edit graph" where:

Moving right means deleting a line from the old file
Moving down means inserting a line from the new file
Moving diagonally means the lines match (no edit needed)

The algorithm finds the shortest path from top-left to bottom-right, which corresponds to the minimal diff. Its time complexity is $O(n \times d)$ where d is the size of the minimum edit script. For similar files (small d), this is much faster than $O(n \times m)$.

Unified Diff Format

The most common diff output format. Here's a complete example:

--- a/config.json
+++ b/config.json
@@ -1,6 +1,7 @@
 {
   "name": "my-app",
-  "version": "1.0.0",
+  "version": "1.1.0",
+  "description": "A sample app",
   "main": "index.js",
   "scripts": {
     "start": "node index.js"

Breaking it down:

Header:

--- a/config.json — the old file
+++ b/config.json — the new file

Hunk header:

@@ -1,6 +1,7 @@ means:
- -1,6 — starting at line 1 of the old file, showing 6 lines
- +1,7 — starting at line 1 of the new file, showing 7 lines

Content:

Lines starting with a space are unchanged (context)
Lines starting with - exist only in the old file (deleted)
Lines starting with + exist only in the new file (added)
A line with both - and + versions represents a modification

Context Lines

Diffs typically show 3 lines of unchanged context around each change. This serves two purposes:

Human readability — context helps you understand where in the file the change occurs
Patch application — when applying a diff as a patch, context lines help locate the correct position even if line numbers have shifted

You can control context with git diff -U<n> where n is the number of context lines. -U0 shows no context (just the changes), -U10 shows 10 lines of context.

Word-Level and Character-Level Diffs

Line-level diffs can be noisy when only a single word on a line changed:

-  The quick brown fox jumps over the lazy dog.
+  The quick brown fox leaps over the lazy dog.

The entire line is marked as changed even though only one word differs. Word-level or character-level diffs highlight exactly what changed within the line:

  The quick brown fox {-jumps-}{+leaps+} over the lazy dog.

git diff --word-diff provides this, and most visual diff tools (VS Code, GitHub) highlight intra-line differences automatically.

The text diff tool supports both line-level and highlighted inline differences.

Three-Way Merge

When two people edit the same file from a common ancestor, you need a three-way diff:

    Base (ancestor)
       /    \
    Ours    Theirs
       \    /
     Merged

The merge algorithm:

Diff "base vs ours" to find our changes
Diff "base vs theirs" to find their changes
If changes are in different regions, apply both automatically
If changes overlap (same lines modified differently), flag a merge conflict

This is exactly what git merge does. Conflicts are marked with:

<<<<<<< HEAD
our version of the line
=======
their version of the line
>>>>>>> feature-branch

Common Diff Use Cases

Code Review

Every pull request shows diffs. Reading diffs efficiently is a core development skill. Focus on:

What changed (the + and - lines)
Why it changed (the PR description and commit messages)
What's around the change (context lines)

Configuration Changes

Diffing configuration files catches unintended changes. Before deploying a config update, diff it against production to verify only intended changes are present.

Data Validation

Comparing two data exports reveals discrepancies. JSON files can be diffed effectively if you first format them consistently (sorted keys, consistent indentation) — otherwise formatting differences create noise.

Debugging

"It worked yesterday, doesn't today." Diffing yesterday's code against today's immediately shows what changed, narrowing the debugging scope.

Performance Considerations

Diff algorithms slow down with:

Very large files (>100MB) — consider splitting or sampling
Very different files (few common lines) — the edit distance d approaches n, losing the Myers algorithm's advantage
Binary files — line-based diff is meaningless; use specialized tools

For large files, the patience diff algorithm (used by git diff --patience) produces more readable output by anchoring on unique lines, at the cost of slightly longer computation.

Semantic vs Textual Diffs

Standard diffs are purely textual — they don't understand the language. A semantic diff understands that:

- function foo(a, b) {
+ function foo(b, a) {

...swapped function parameters, which may have widespread implications. Textual diff just sees two changed lines.

Tools like difftastic provide language-aware (AST-based) diffs that understand code structure, showing "parameter order changed" rather than "line changed."

Conclusion

Diff algorithms are one of those foundational tools that quietly power modern software development. Every git commit, every code review, every merge relies on efficiently computing the minimal set of changes between two texts.

Use the text diff tool for quick browser-based file comparisons. For structured data, the JSON formatter can normalize formatting before diffing. And when you need to verify file integrity rather than content differences, the hash generator can confirm whether two files are identical without examining every line.

Related Tools

diff json hash