Cache-Optimized Matrix Multiplication

Concept

Tiled (blocked) matrix multiplication reorganizes the computation to process small sub-blocks that fit in the CPU cache, reducing memory traffic and improving performance.

Why This Matters

Naive matrix multiplication is memory-bound — the CPU spends most of its time waiting for data from RAM, not doing arithmetic. On modern hardware, reading from RAM is ~100× slower than reading from the L1 cache. Tiling keeps data in cache longer, turning a memory-bound operation into a compute-bound one. Every high-performance matrix library (BLAS, OpenBLAS, Intel MKL) uses some form of tiling. Understanding tiling is the first step toward understanding how real numerical software extracts maximum performance from hardware.


Background: Why Naive Multiplication Is Slow

The naive algorithm from ch03:

for i in 0..m {
    for j in 0..p {
        let mut sum = 0.0;
        for k in 0..n {
            sum += self.get(i, k) * other.get(k, j);
        }
        data.push(sum);
    }
}

Look at the inner loop: for each $C_{ij}$, we stride across row $i$ of $\mathbf{A}$ (sequential in memory — good) and column $j$ of $\mathbf{B}$ (stride of other.colsbad).

Matrix $\mathbf{B}$ is stored in row-major order. Accessing $\mathbf{B}_{kj}$ means jumping by other.cols elements each time. For a large matrix, each access misses the cache and must fetch from RAM. Over the entire multiplication, every element of $\mathbf{B}$ is loaded from RAM $m$ times — once for each row of $\mathbf{A}$.

The same is true for $\mathbf{A}$ if we swap the loop order. The core problem: the naive algorithm cannot reuse data — each element of $\mathbf{A}$ and $\mathbf{B}$ is loaded from memory many times.


Intuition

The cache hierarchy

A modern CPU has multiple levels of cache:

LevelTypical sizeLatency (cycles)Relative to L1
L132 KB~4
L2256 KB~12
L38–32 MB~4010×
RAM8+ GB~20050×

If data fits in L1, the CPU can access it in ~4 cycles. If it has to go to RAM, that's ~200 cycles — 50× slower.

Tiling

Instead of multiplying the entire matrix at once, we break it into tiles (blocks) small enough that a tile from $\mathbf{A}$ and a tile from $\mathbf{B}$ fit in the cache together. Then we process one tile at a time:

A (m×n)          B (n×p)          C (m×p)
┌───────┐        ┌───────┐        ┌───────┐
│ ██ ██ │        │ ██ ██ │        │ ██ ██ │
│ ██ ██ │        │ ██ ██ │        │ ██ ██ │
│ ██ ██ │   ×    │ ██ ██ │   =    │ ██ ██ │
│ ██ ██ │        │ ██ ██ │        │ ██ ██ │
│ ██ ██ │        │ ██ ██ │        │ ██ ██ │
└───────┘        └───────┘        └───────┘

Tiled view (block_size = 2):
  A tile of A (rows i..i+BS, k..k+BS)   B tile (k..k+BS, j..j+BS)   C tile (i..i+BS, j..j+BS)
    ┌───────┐                             ┌───────┐                    ┌───────┐
    │a11 a12│                             │b11 b12│                    │c11 c12│
    │a21 a22│                             │b21 b22│                    │c21 c22│
    └───────┘                             └───────┘                    └───────┘

Each result tile $C_{ij}$ is the sum of products of corresponding tiles from $\mathbf{A}$ and $\mathbf{B}$:

$$ C_{ij} = \sum_{k} A_{ik} B_{kj} $$

This is the same formula as naive multiplication — we've just changed the order of summation to group operations by tile. Mathematically it's identical; computationally it's much faster.

Within a tile, data is accessed sequentially (row-major), and the tile stays in cache across all three inner loops.


Mathematical Notation

Tiled multiplication computes the same result but groups the summation into blocks:

$$ \mathbf{C}_{ij} = \sum_{t=0}^{T-1} \mathbf{A}_{i,t} \mathbf{B}_{t,j} $$

where:

The block size $b$ is chosen so that two tiles (one from $\mathbf{A}$, one from $\mathbf{B}$) plus the output tile fit in the L1 or L2 cache:

$$ 2 \times b^2 \times 8 \text{ bytes} \leq \text{cache size} $$

For a 32 KB L1 cache (8 bytes per f64):

$$ b \leq \sqrt{\frac{32768}{2 \times 8}} \approx 45 $$

Common block sizes are powers of two: 16, 32, 64. We use 32 as a portable default.


Rust Implementation

Add the following methods to the Matrix struct from ch03 (which you created at ch03/src/lib.rs). Alternatively, create a new crate and copy the Matrix code from ch03:

# Option: extend your existing ch03 crate
cd ch03   # or create a new crate: cargo new --name ch04 --lib ch04 && cd ch04

Add to src/lib.rs within the impl Matrix block:

/// Tiled (blocked) matrix multiplication.
///
/// Processes the matrix in blocks of size `block_size` that fit in the CPU cache,
/// reducing memory traffic compared to the naive algorithm.
///
/// C_{ij} = Σ_{t} A_{i,t} B_{t,j}  (tiled formulation)
///
/// # Panics
/// Panics if self.cols != other.rows (incompatible dimensions).
pub fn multiply_tiled(&self, other: &Matrix, block_size: usize) -> Matrix {
    assert_eq!(
        self.cols, other.rows,
        "Incompatible dimensions: {}×{} vs {}×{}",
        self.rows, self.cols, other.rows, other.cols
    );

    let m = self.rows;
    let n = self.cols;
    let p = other.cols;

    // Initialize result matrix with zeros
    let mut c = Matrix::new(vec![0.0; m * p], m, p);

    for i in (0..m).step_by(block_size) {
        let i_end = (i + block_size).min(m);
        for j in (0..p).step_by(block_size) {
            let j_end = (j + block_size).min(p);
            for k in (0..n).step_by(block_size) {
                let k_end = (k + block_size).min(n);

                // Multiply the current tile: A[i..i_end][k..k_end] × B[k..k_end][j..j_end]
                for ii in i..i_end {
                    for kk in k..k_end {
                        let a_ik = self.data[ii * n + kk];
                        for jj in j..j_end {
                            c.data[ii * p + jj] += a_ik * other.data[kk * p + jj];
                        }
                    }
                }
            }
        }
    }

    c
}

The inner loop order matters: i, k, j (not i, j, k). Here's why:

Benchmarking

Add a benchmark function to compare naive vs tiled:

/// Compare naive and tiled multiplication for correctness and performance.
pub fn benchmark_multiply(size: usize, block_size: usize) {
    use std::time::Instant;

    // Create two random-ish matrices
    let a_data: Vec<f64> = (0..size * size).map(|i| (i as f64 + 1.0) * 0.5).collect();
    let b_data: Vec<f64> = (0..size * size).map(|i| (i as f64 + 1.0) * 0.25).collect();
    let a = Matrix::new(a_data, size, size);
    let b = Matrix::new(b_data, size, size);

    // Time naive
    let start = Instant::now();
    let c_naive = a.multiply(&b);
    let naive_time = start.elapsed();

    // Time tiled
    let start = Instant::now();
    let c_tiled = a.multiply_tiled(&b, block_size);
    let tiled_time = start.elapsed();

    // Verify correctness
    assert_eq!(c_naive.shape(), c_tiled.shape());
    for i in 0..size {
        for j in 0..size {
            let diff = (c_naive.get(i, j) - c_tiled.get(i, j)).abs();
            assert!(
                diff < 1e-10,
                "Mismatch at ({}, {}): naive={}, tiled={}",
                i, j,
                c_naive.get(i, j),
                c_tiled.get(i, j)
            );
        }
    }

    println!("Size {}×{}, block_size={}:", size, size, block_size);
    println!("  Naive: {:?}", naive_time);
    println!("  Tiled: {:?}", tiled_time);
    println!("  Speedup: {:.2}×", naive_time.as_secs_f64() / tiled_time.as_secs_f64());
}

Run it from main:

fn main() {
    Matrix::benchmark_multiply(128, 32);
    Matrix::benchmark_multiply(256, 32);
    Matrix::benchmark_multiply(512, 32);
}

Add this to src/main.rs:

use ch03::Matrix;

fn main() {
    println!("Benchmarking naive vs tiled matrix multiplication\n");

    for &size in &[64, 128, 256] {
        Matrix::benchmark_multiply(size, 32);
        println!();
    }
}

Run it:

cargo run --release

The --release flag is essential — without optimizations, the naive algorithm can sometimes be faster because the compiler doesn't optimize the tiled version's loop structure as well. With --release, you should see the tiled version pull ahead, especially for larger matrices.

Example output (actual numbers depend on your CPU):

Size 64×64, block_size=32:
  Naive: 3.2ms
  Tiled: 1.1ms
  Speedup: 2.91×

Size 128×128, block_size=32:
  Naive: 21.4ms
  Tiled: 6.5ms
  Speedup: 3.29×

Size 256×256, block_size=32:
  Naive: 178.3ms
  Tiled: 47.2ms
  Speedup: 3.78×

The speedup grows with matrix size because larger matrices amplify the cache-miss penalty in the naive version.


Verification

The benchmark_multiply method already verifies correctness by comparing every element of the naive and tiled results. Add these focused tests as well:

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_tiled_matches_naive_small() {
        let a = Matrix::new(vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0], 2, 3);
        let b = Matrix::new(vec![7.0, 8.0, 9.0, 10.0, 11.0, 12.0], 3, 2);
        let naive = a.multiply(&b);
        let tiled = a.multiply_tiled(&b, 2);
        assert_eq!(naive, tiled);
    }

    #[test]
    fn test_tiled_matches_naive_square() {
        let a = Matrix::new(vec![1.0, 2.0, 3.0, 4.0], 2, 2);
        let b = Matrix::new(vec![5.0, 6.0, 7.0, 8.0], 2, 2);
        let naive = a.multiply(&b);
        let tiled = a.multiply_tiled(&b, 2);
        assert_eq!(naive, tiled);
    }

    #[test]
    fn test_tiled_various_block_sizes() {
        // Generate a 6×6 matrix with sequential values
        let data: Vec<f64> = (0..36).map(|i| i as f64).collect();
        let a = Matrix::new(data.clone(), 6, 6);
        let b = Matrix::new(data, 6, 6);

        let naive = a.multiply(&b);

        for &block_size in &[1, 2, 3, 4, 6] {
            let tiled = a.multiply_tiled(&b, block_size);
            assert_eq!(
                naive, tiled,
                "Tiled multiplication failed for block_size={}",
                block_size
            );
        }
    }

    #[test]
    fn test_tiled_non_square() {
        // 4×3 times 3×5
        let a = Matrix::new((0..12).map(|i| i as f64).collect(), 4, 3);
        let b = Matrix::new((0..15).map(|i| i as f64).collect(), 3, 5);
        let naive = a.multiply(&b);
        let tiled = a.multiply_tiled(&b, 3);
        assert_eq!(naive, tiled);
    }

    #[test]
    fn test_tiled_identity() {
        let a = Matrix::new(vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0], 2, 3);
        let i = Matrix::identity(3);
        let naive = a.multiply(&i);
        let tiled = a.multiply_tiled(&i, 2);
        assert_eq!(naive, tiled);
    }
}

Run tests:

cargo test

You should see:

running 5 tests
test tests::test_tiled_matches_naive_small ... ok
test tests::test_tiled_matches_naive_square ... ok
test tests::test_tiled_non_square ... ok
test tests::test_tiled_identity ... ok
test tests::test_tiled_various_block_sizes ... ok

test result: ok. 5 passed; 0 failed; 0 ignored; 0 measured; 0 filtered

Walkthrough


Key Takeaways