Cache-Optimized Matrix Multiplication

Concept

Tiled (blocked) matrix multiplication reorganizes the computation to process small sub-blocks that fit in the CPU cache, reducing memory traffic and improving performance.

Why This Matters

Naive matrix multiplication is memory-bound — the CPU spends most of its time waiting for data from RAM, not doing arithmetic. On modern hardware, reading from RAM is ~100× slower than reading from the L1 cache. Tiling keeps data in cache longer, turning a memory-bound operation into a compute-bound one. Every high-performance matrix library (BLAS, OpenBLAS, Intel MKL) uses some form of tiling. Understanding tiling is the first step toward understanding how real numerical software extracts maximum performance from hardware.

Background: Why Naive Multiplication Is Slow

The naive algorithm from ch03:

for i in 0..m {
    for j in 0..p {
        let mut sum = 0.0;
        for k in 0..n {
            sum += self.get(i, k) * other.get(k, j);
        }
        data.push(sum);
    }
}

Look at the inner loop: for each $C_{ij}$, we stride across row $i$ of $\mathbf{A}$ (sequential in memory — good) and column $j$ of $\mathbf{B}$ (stride of other.cols — bad).

Matrix $\mathbf{B}$ is stored in row-major order. Accessing $\mathbf{B}_{kj}$ means jumping by other.cols elements each time. For a large matrix, each access misses the cache and must fetch from RAM. Over the entire multiplication, every element of $\mathbf{B}$ is loaded from RAM $m$ times — once for each row of $\mathbf{A}$.

The same is true for $\mathbf{A}$ if we swap the loop order. The core problem: the naive algorithm cannot reuse data — each element of $\mathbf{A}$ and $\mathbf{B}$ is loaded from memory many times.

Intuition

The cache hierarchy

A modern CPU has multiple levels of cache:

Level	Typical size	Latency (cycles)	Relative to L1
L1	32 KB	~4	1×
L2	256 KB	~12	3×
L3	8–32 MB	~40	10×
RAM	8+ GB	~200	50×

If data fits in L1, the CPU can access it in ~4 cycles. If it has to go to RAM, that's ~200 cycles — 50× slower.

Tiling

Instead of multiplying entire matrices at once, we break them into tiles (blocks) small enough that two tiles — one from $\mathbf{A}$ and one from $\mathbf{B}$ — fit in the CPU cache together. The outer loops iterate over tile positions, and the inner loops perform a standard matrix multiplication on each tile pair.

The loop order is i → j → k: for each row of tiles, for each column of tiles, accumulate over the shared (summation) dimension.

This is the same formula as naive multiplication — we've just changed the order of summation to group operations by tile. Mathematically it's identical; computationally it's much faster.

Complete trace: 2×2 with block_size = 1

Let's trace every operation with concrete values to see exactly how tiles build up the result.

\mathbf{A} = \begin{pmatrix}1 & 2 \\ 3 & 4\end{pmatrix}, \qquad \mathbf{B} = \begin{pmatrix}5 & 6 \\ 7 & 8\end{pmatrix}, \qquad \mathbf{C} = \mathbf{A} \times \mathbf{B}

Expected result:

\mathbf{C} = \begin{pmatrix} 1\cdot5 + 2\cdot7 = 19 & 1\cdot6 + 2\cdot8 = 22 \\[4pt] 3\cdot5 + 4\cdot7 = 43 & 3\cdot6 + 4\cdot8 = 50 \end{pmatrix}

With block_size = 1, each tile is a single element. The loops (i, j, k with step_by(1)) produce this sequence:

i=0, j=0 — computing C[0,0]:

k	A element	B element	Operation	C[0,0] after
0	A[0,0] = 1	B[0,0] = 5	C[0,0] += 1×5	5
1	A[0,1] = 2	B[1,0] = 7	C[0,0] += 2×7	19

i=0, j=1 — computing C[0,1]:

k	A element	B element	Operation	C[0,1] after
0	A[0,0] = 1	B[0,1] = 6	C[0,1] += 1×6	6
1	A[0,1] = 2	B[1,1] = 8	C[0,1] += 2×8	22

i=1, j=0 — computing C[1,0]:

k	A element	B element	Operation	C[1,0] after
0	A[1,0] = 3	B[0,0] = 5	C[1,0] += 3×5	15
1	A[1,1] = 4	B[1,0] = 7	C[1,0] += 4×7	43

i=1, j=1 — computing C[1,1]:

k	A element	B element	Operation	C[1,1] after
0	A[1,0] = 3	B[0,1] = 6	C[1,1] += 3×6	18
1	A[1,1] = 4	B[1,1] = 8	C[1,1] += 4×8	50

Each C element is the sum of two products, accumulated across two k iterations. The $k$ loop is the critical link: it pairs elements from column $k$ of $\mathbf{A}$ with elements from row $k$ of $\mathbf{B}$ — exactly what matrix multiplication requires.

Cache behavior: 64×64 with 32×32 tiles

With a realistic block size, the key question is: what stays in cache and what gets reloaded? Let's trace a 64×64 multiplication with block_size = 32. Each matrix splits into four 32×32 tiles:

\mathbf{A} = \begin{pmatrix}\mathbf{A}_{00} & \mathbf{A}_{01} \\ \mathbf{A}_{10} & \mathbf{A}_{11}\end{pmatrix},\quad \mathbf{B} = \begin{pmatrix}\mathbf{B}_{00} & \mathbf{B}_{01} \\ \mathbf{B}_{10} & \mathbf{B}_{11}\end{pmatrix},\quad \mathbf{C} = \begin{pmatrix}\mathbf{C}_{00} & \mathbf{C}_{01} \\ \mathbf{C}_{10} & \mathbf{C}_{11}\end{pmatrix}

Each 32×32 tile is $32 \times 32 \times 8 = 8$ KB. Two tiles plus the output tile fit in a 32 KB L1 cache. The tile-level execution:

i=0:
  j=0:
    k=0:  A₀₀ × B₀₀  →  C₀₀  (A₀₀, B₀₀ in L1; C₀₀ accumulates)
    k=1:  A₀₁ × B₁₀  →  C₀₀  (A₀₁, B₁₀ loaded fresh; C₀₀ stays; A₀₀, B₀₀ evicted)
  j=1:
    k=0:  A₀₀ × B₀₁  →  C₀₁  (A₀₀, B₀₁ loaded fresh; C₀₁ accumulates)
    k=1:  A₀₁ × B₁₁  →  C₀₁  (A₀₁, B₁₁ loaded fresh; C₀₁ stays)
i=1:
  j=0:
    k=0:  A₁₀ × B₀₀  →  C₁₀
    k=1:  A₁₁ × B₁₀  →  C₁₀
  j=1:
    k=0:  A₁₀ × B₀₁  →  C₁₁
    k=1:  A₁₁ × B₁₁  →  C₁₁

Key observations:

Each tile of $\mathbf{A}$ and $\mathbf{B}$ is reloaded $N/B = 2$ times from RAM. For example, $\mathbf{A}_{00}$ is loaded for j=0, evicted, then reloaded for j=1.
Each $\mathbf{C}$ tile stays in cache across all $k$ iterations for its (i,j) position, accumulating contributions.
Partial $\mathbf{C}$ tiles are never evicted mid-accumulation because the $\mathbf{A}$ and $\mathbf{B}$ tiles cycling through leave room for them.

Comparison with the naive algorithm:

Algorithm	Times each B tile is loaded	Total B data loaded
Naive	64 times (once per A row)	32,768 KB
Tiled (32×32)	2 times	1,024 KB

The tiled algorithm loads 32× less data from RAM. For a general $N \times N$ matrix with block size $B$:

Naive: each B element loaded $N$ times
Tiled: each B element loaded $N/B$ times
Speedup factor ≈ $B$ in memory traffic

Why tiling works: the hardware does the caching

The multiply_tiled function contains no instructions like load_into_cache(tile) — it simply reads and writes memory addresses. Modern CPUs have a hardware cache controller that detects sequential access patterns and automatically keeps hot data in L1.

The tiled loop creates spatial locality (the jj loop reads consecutive f64 values — each cache miss brings in a full 64-byte cache line covering several upcoming elements) and temporal locality (the $\mathbf{B}$ tile stays resident across kk iterations, so row $k$ is still in L1 when kk advances).

The $k$ loop inside each tile:

kk=0: load A[i][0]  →  jj loop over B[0][0..BS]  (cold — cache miss)
kk=1: load A[i][1]  →  jj loop over B[1][0..BS]  (B row 1 cold, row 0 still hot)
kk=2: load A[i][2]  →  jj loop over B[2][0..BS]  (B row 2 cold, rows 0-1 hot)

Without tiling, each advancement of k would stride across columns of a large $\mathbf{B}$ — evicting the previous cache line before it's reused. With tiling, the $\mathbf{B}$ tile is small enough to stay resident across all kk iterations.

The block_size is a design parameter, not a runtime command. The formula $2 \times b^2 \times 8 \text{ bytes} \leq \text{cache size}$ answers: "what block size makes the hardware's automatic caching work well?" Pick a size that satisfies it, and the cache controller naturally keeps the working set in L1. Pick one that violates it (e.g., $b = 256 \rightarrow$ two tiles = 1 MB ≫ L1), and the speedup vanishes.

Mathematical Notation

Tiled multiplication computes the same result but groups the summation into blocks:

\mathbf{C}_{ij} = \sum_{t=0}^{T-1} \mathbf{A}_{i,t} \mathbf{B}_{t,j}

where:

$\mathbf{C}_{ij}$ is a tile of $\mathbf{C}$ of size $b \times b$ (starting at row $i \cdot b$, column $j \cdot b$)
$\mathbf{A}_{i,t}$ is a tile of $\mathbf{A}$ of size $b \times b$
$\mathbf{B}_{t,j}$ is a tile of $\mathbf{B}$ of size $b \times b$
$T$ is the number of tiles along the shared dimension
$b$ is the block size

The block size $b$ is chosen so that two tiles (one from $\mathbf{A}$, one from $\mathbf{B}$) fit in the L1 or L2 cache simultaneously (the output tile stays in cache):

2 \times b^2 \times 8 \text{ bytes} \leq \text{cache size}

For a 32 KB L1 cache (8 bytes per f64):

b \leq \sqrt{\frac{32768}{2 \times 8}} \approx 45

Common block sizes are powers of two: 16, 32, 64. We use 32 as a portable default.

A note on registers vs. cache. A $32 \times 32$ tile of f64 values is 8 KB — far too large for a CPU's register file (typically 16–32 registers per core, each 8 bytes). The tile lives in the L1 cache, not in registers. The compiler may keep a handful of individual $C_{ij}$ accumulators in registers through scalar replacement (hoisting repeated memory operations into registers), but the tile as a whole resides in cache. True register-level tiling — partitioning into micro-tiles small enough for registers — is a more advanced technique used in libraries like BLIS and OpenBLAS.

Rust Implementation

Add a new crate to your workspace:

cd code && cargo new --lib --edition 2024 ch04-cache-optimized-matmul

This crate builds on the Matrix struct from ch03. Open code/ch04-cache-optimized-matmul/src/lib.rs and start by copying the full Matrix implementation from ch03-linear-algebra-matrices/src/lib.rs (the struct definition, all impl methods, and the tests). Then add the following methods within the impl Matrix block:

/// Tiled (blocked) matrix multiplication.
///
/// Processes the matrix in blocks of size `block_size` that fit in the CPU cache,
/// reducing memory traffic compared to the naive algorithm.
///
/// C_{ij} = Σ_{t} A_{i,t} B_{t,j}  (tiled formulation)
///
/// # Panics
/// Panics if self.cols != other.rows (incompatible dimensions).
pub fn multiply_tiled(&self, other: &Matrix, block_size: usize) -> Matrix {
    assert_eq!(
        self.cols, other.rows,
        "Incompatible dimensions: {}×{} vs {}×{}",
        self.rows, self.cols, other.rows, other.cols
    );

    let m = self.rows;
    let n = self.cols;
    let p = other.cols;

    // Initialize result matrix with zeros
    let mut c = Matrix::new(vec![0.0; m * p], m, p);

    for i in (0..m).step_by(block_size) {
        let i_end = (i + block_size).min(m);
        for j in (0..p).step_by(block_size) {
            let j_end = (j + block_size).min(p);
            for k in (0..n).step_by(block_size) {
                let k_end = (k + block_size).min(n);

                // Multiply the current tile: A[i..i_end][k..k_end] × B[k..k_end][j..j_end]
                for ii in i..i_end {
                    for kk in k..k_end {
                        let a_ik = self.data[ii * n + kk];
                        for jj in j..j_end {
                            c.data[ii * p + jj] += a_ik * other.data[kk * p + jj];
                        }
                    }
                }
            }
        }
    }

    c
}

The inner loop order matters: i, k, j (not i, j, k). Here's why:

The kk loop loads one element from $\mathbf{A}$ (a_ik), then iterates over columns of $\mathbf{B}$ (jj). Within the jj loop, $\mathbf{B}$ is accessed sequentially (row-major) — cache-friendly.
The result $\mathbf{C}$ is accumulated in-place, so the output tile stays in cache.
By processing a tile of $\mathbf{A}$ with a tile of $\mathbf{B}$, both tiles fit in cache together.

Benchmarking

Add a benchmark function to compare naive vs tiled:

/// Compare naive and tiled multiplication for correctness and performance.
pub fn benchmark_multiply(size: usize, block_size: usize) {
    use std::time::Instant;

    // Create two random-ish matrices
    let a_data: Vec<f64> = (0..size * size).map(|i| (i as f64 + 1.0) * 0.5).collect();
    let b_data: Vec<f64> = (0..size * size).map(|i| (i as f64 + 1.0) * 0.25).collect();
    let a = Matrix::new(a_data, size, size);
    let b = Matrix::new(b_data, size, size);

    // Time naive
    let start = Instant::now();
    let c_naive = a.multiply(&b);
    let naive_time = start.elapsed();

    // Time tiled
    let start = Instant::now();
    let c_tiled = a.multiply_tiled(&b, block_size);
    let tiled_time = start.elapsed();

    // Verify correctness
    assert_eq!(c_naive.shape(), c_tiled.shape());
    for i in 0..size {
        for j in 0..size {
            let diff = (c_naive.get(i, j) - c_tiled.get(i, j)).abs();
            assert!(
                diff < 1e-10,
                "Mismatch at ({}, {}): naive={}, tiled={}",
                i, j,
                c_naive.get(i, j),
                c_tiled.get(i, j)
            );
        }
    }

    println!("Size {}×{}, block_size={}:", size, size, block_size);
    println!("  Naive: {:?}", naive_time);
    println!("  Tiled: {:?}", tiled_time);
    println!("  Speedup: {:.2}×", naive_time.as_secs_f64() / tiled_time.as_secs_f64());
}

Add this to src/main.rs:

use ch04_cache_optimized_matmul::Matrix;

fn main() {
    println!("Benchmarking naive vs tiled matrix multiplication\n");

    for &size in &[64, 128, 256] {
        Matrix::benchmark_multiply(size, 32);
        println!();
    }
}

Run it:

cargo run --release -p ch04-cache-optimized-matmul

The --release flag is essential — without optimizations, the naive algorithm can sometimes be faster because the compiler doesn't optimize the tiled version's loop structure as well. With --release, you should see the tiled version pull ahead, especially for larger matrices.

Example output (actual numbers depend on your CPU):

Size 64×64, block_size=32:
  Naive: 3.2ms
  Tiled: 1.1ms
  Speedup: 2.91×

Size 128×128, block_size=32:
  Naive: 21.4ms
  Tiled: 6.5ms
  Speedup: 3.29×

Size 256×256, block_size=32:
  Naive: 178.3ms
  Tiled: 47.2ms
  Speedup: 3.78×

The speedup grows with matrix size because larger matrices amplify the cache-miss penalty in the naive version.

Verification

The benchmark_multiply method already verifies correctness by comparing every element of the naive and tiled results. Add these focused tests as well:

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_tiled_matches_naive_small() {
        let a = Matrix::new(vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0], 2, 3);
        let b = Matrix::new(vec![7.0, 8.0, 9.0, 10.0, 11.0, 12.0], 3, 2);
        let naive = a.multiply(&b);
        let tiled = a.multiply_tiled(&b, 2);
        assert_eq!(naive, tiled);
    }

    #[test]
    fn test_tiled_matches_naive_square() {
        let a = Matrix::new(vec![1.0, 2.0, 3.0, 4.0], 2, 2);
        let b = Matrix::new(vec![5.0, 6.0, 7.0, 8.0], 2, 2);
        let naive = a.multiply(&b);
        let tiled = a.multiply_tiled(&b, 2);
        assert_eq!(naive, tiled);
    }

    #[test]
    fn test_tiled_various_block_sizes() {
        // Generate a 6×6 matrix with sequential values
        let data: Vec<f64> = (0..36).map(|i| i as f64).collect();
        let a = Matrix::new(data.clone(), 6, 6);
        let b = Matrix::new(data, 6, 6);

        let naive = a.multiply(&b);

        for &block_size in &[1, 2, 3, 4, 6] {
            let tiled = a.multiply_tiled(&b, block_size);
            assert_eq!(
                naive, tiled,
                "Tiled multiplication failed for block_size={}",
                block_size
            );
        }
    }

    #[test]
    fn test_tiled_non_square() {
        // 4×3 times 3×5
        let a = Matrix::new((0..12).map(|i| i as f64).collect(), 4, 3);
        let b = Matrix::new((0..15).map(|i| i as f64).collect(), 3, 5);
        let naive = a.multiply(&b);
        let tiled = a.multiply_tiled(&b, 3);
        assert_eq!(naive, tiled);
    }

    #[test]
    fn test_tiled_identity() {
        let a = Matrix::new(vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0], 2, 3);
        let i = Matrix::identity(3);
        let naive = a.multiply(&i);
        let tiled = a.multiply_tiled(&i, 2);
        assert_eq!(naive, tiled);
    }
}

Run tests:

cargo test -p ch04-cache-optimized-matmul

You should see:

running 5 tests
test tests::test_tiled_matches_naive_small ... ok
test tests::test_tiled_matches_naive_square ... ok
test tests::test_tiled_non_square ... ok
test tests::test_tiled_identity ... ok
test tests::test_tiled_various_block_sizes ... ok

test result: ok. 5 passed; 0 failed; 0 ignored; 0 measured; 0 filtered

Walkthrough

Tiling — The outer three loops (i, j, k) iterate over tile positions, stepping by block_size. Each triple-nested-tile is a small matrix multiplication that fits in cache.
Inner loop order i, k, j — For each element $A_{ik}$, we iterate over all $j$ columns in the output tile. This means $\mathbf{B}$ is accessed sequentially (row-major within the tile), and $\mathbf{C}$ accumulates in a fixed row. Compare with the naive order i, j, k: there, for each $C_{ij}$, we iterate over $k$, which strided across $\mathbf{B}$'s columns.
step_by — Rust's Iterator::step_by method gives us clean tile-boundary iteration without manual offset arithmetic.
min for edge tiles — When the matrix dimension isn't a multiple of block_size, the last tile in each dimension is smaller. (i + block_size).min(m) handles this without special cases.
In-place accumulation — We initialize $\mathbf{C}$ with zeros, then accumulate results directly. A tile of $\mathbf{C}$ stays in cache while we sum over the $k$ dimension. This is the key to the performance gain.
No get calls — The tiled version accesses self.data and other.data directly rather than through the get() method. This avoids bounds-check overhead in the innermost loop — a common optimization for hot paths.

Key Takeaways

Naive matrix multiplication is memory-bound — the CPU stalls waiting for data from RAM.
Tiling reorganizes the computation to reuse data in the CPU cache, multiplying small blocks at a time.
The block size $b$ is chosen so that two tiles (one from $\mathbf{A}$, one from $\mathbf{B}$) fit in cache simultaneously (the output tile stays in cache): $2 \times b^2 \times 8 \text{ bytes} \leq \text{L1 cache size}$.
The inner loop order i, k, j ensures sequential memory access pattern within each tile.
Tiled multiplication is mathematically identical to naive multiplication — same formula, different order of summation.
Always benchmark with --release for meaningful comparisons.