Cache-Optimized Matrix Multiplication
Concept
Tiled (blocked) matrix multiplication reorganizes the computation to process small sub-blocks that fit in the CPU cache, reducing memory traffic and improving performance.
Why This Matters
Naive matrix multiplication is memory-bound — the CPU spends most of its time waiting for data from RAM, not doing arithmetic. On modern hardware, reading from RAM is ~100× slower than reading from the L1 cache. Tiling keeps data in cache longer, turning a memory-bound operation into a compute-bound one. Every high-performance matrix library (BLAS, OpenBLAS, Intel MKL) uses some form of tiling. Understanding tiling is the first step toward understanding how real numerical software extracts maximum performance from hardware.
Background: Why Naive Multiplication Is Slow
The naive algorithm from ch03:
for i in 0..m {
for j in 0..p {
let mut sum = 0.0;
for k in 0..n {
sum += self.get(i, k) * other.get(k, j);
}
data.push(sum);
}
}
Look at the inner loop: for each $C_{ij}$, we stride across row $i$ of $\mathbf{A}$ (sequential in memory — good) and column $j$ of $\mathbf{B}$ (stride of other.cols — bad).
Matrix $\mathbf{B}$ is stored in row-major order. Accessing $\mathbf{B}_{kj}$ means jumping by other.cols elements each time. For a large matrix, each access misses the cache and must fetch from RAM. Over the entire multiplication, every element of $\mathbf{B}$ is loaded from RAM $m$ times — once for each row of $\mathbf{A}$.
The same is true for $\mathbf{A}$ if we swap the loop order. The core problem: the naive algorithm cannot reuse data — each element of $\mathbf{A}$ and $\mathbf{B}$ is loaded from memory many times.
Intuition
The cache hierarchy
A modern CPU has multiple levels of cache:
| Level | Typical size | Latency (cycles) | Relative to L1 |
|---|---|---|---|
| L1 | 32 KB | ~4 | 1× |
| L2 | 256 KB | ~12 | 3× |
| L3 | 8–32 MB | ~40 | 10× |
| RAM | 8+ GB | ~200 | 50× |
If data fits in L1, the CPU can access it in ~4 cycles. If it has to go to RAM, that's ~200 cycles — 50× slower.
Tiling
Instead of multiplying the entire matrix at once, we break it into tiles (blocks) small enough that a tile from $\mathbf{A}$ and a tile from $\mathbf{B}$ fit in the cache together. Then we process one tile at a time:
A (m×n) B (n×p) C (m×p)
┌───────┐ ┌───────┐ ┌───────┐
│ ██ ██ │ │ ██ ██ │ │ ██ ██ │
│ ██ ██ │ │ ██ ██ │ │ ██ ██ │
│ ██ ██ │ × │ ██ ██ │ = │ ██ ██ │
│ ██ ██ │ │ ██ ██ │ │ ██ ██ │
│ ██ ██ │ │ ██ ██ │ │ ██ ██ │
└───────┘ └───────┘ └───────┘
Tiled view (block_size = 2):
A tile of A (rows i..i+BS, k..k+BS) B tile (k..k+BS, j..j+BS) C tile (i..i+BS, j..j+BS)
┌───────┐ ┌───────┐ ┌───────┐
│a11 a12│ │b11 b12│ │c11 c12│
│a21 a22│ │b21 b22│ │c21 c22│
└───────┘ └───────┘ └───────┘
Each result tile $C_{ij}$ is the sum of products of corresponding tiles from $\mathbf{A}$ and $\mathbf{B}$:
This is the same formula as naive multiplication — we've just changed the order of summation to group operations by tile. Mathematically it's identical; computationally it's much faster.
Within a tile, data is accessed sequentially (row-major), and the tile stays in cache across all three inner loops.
Mathematical Notation
Tiled multiplication computes the same result but groups the summation into blocks:
where:
- $\mathbf{C}_{ij}$ is a tile of $\mathbf{C}$ of size $b \times b$ (starting at row $i \cdot b$, column $j \cdot b$)
- $\mathbf{A}_{i,t}$ is a tile of $\mathbf{A}$ of size $b \times b$
- $\mathbf{B}_{t,j}$ is a tile of $\mathbf{B}$ of size $b \times b$
- $T$ is the number of tiles along the shared dimension
- $b$ is the block size
The block size $b$ is chosen so that two tiles (one from $\mathbf{A}$, one from $\mathbf{B}$) plus the output tile fit in the L1 or L2 cache:
For a 32 KB L1 cache (8 bytes per f64):
Common block sizes are powers of two: 16, 32, 64. We use 32 as a portable default.
Rust Implementation
Add the following methods to the Matrix struct from ch03 (which you created at ch03/src/lib.rs). Alternatively, create a new crate and copy the Matrix code from ch03:
# Option: extend your existing ch03 crate
cd ch03 # or create a new crate: cargo new --name ch04 --lib ch04 && cd ch04
Add to src/lib.rs within the impl Matrix block:
/// Tiled (blocked) matrix multiplication.
///
/// Processes the matrix in blocks of size `block_size` that fit in the CPU cache,
/// reducing memory traffic compared to the naive algorithm.
///
/// C_{ij} = Σ_{t} A_{i,t} B_{t,j} (tiled formulation)
///
/// # Panics
/// Panics if self.cols != other.rows (incompatible dimensions).
pub fn multiply_tiled(&self, other: &Matrix, block_size: usize) -> Matrix {
assert_eq!(
self.cols, other.rows,
"Incompatible dimensions: {}×{} vs {}×{}",
self.rows, self.cols, other.rows, other.cols
);
let m = self.rows;
let n = self.cols;
let p = other.cols;
// Initialize result matrix with zeros
let mut c = Matrix::new(vec![0.0; m * p], m, p);
for i in (0..m).step_by(block_size) {
let i_end = (i + block_size).min(m);
for j in (0..p).step_by(block_size) {
let j_end = (j + block_size).min(p);
for k in (0..n).step_by(block_size) {
let k_end = (k + block_size).min(n);
// Multiply the current tile: A[i..i_end][k..k_end] × B[k..k_end][j..j_end]
for ii in i..i_end {
for kk in k..k_end {
let a_ik = self.data[ii * n + kk];
for jj in j..j_end {
c.data[ii * p + jj] += a_ik * other.data[kk * p + jj];
}
}
}
}
}
}
c
}
The inner loop order matters: i, k, j (not i, j, k). Here's why:
- The
kkloop loads one element from $\mathbf{A}$ (a_ik), then iterates over columns of $\mathbf{B}$ (jj). Within thejjloop, $\mathbf{B}$ is accessed sequentially (row-major) — cache-friendly. - The result $\mathbf{C}$ is accumulated in-place, so the output tile stays in cache.
- By processing a tile of $\mathbf{A}$ with a tile of $\mathbf{B}$, both tiles fit in cache together.
Benchmarking
Add a benchmark function to compare naive vs tiled:
/// Compare naive and tiled multiplication for correctness and performance.
pub fn benchmark_multiply(size: usize, block_size: usize) {
use std::time::Instant;
// Create two random-ish matrices
let a_data: Vec<f64> = (0..size * size).map(|i| (i as f64 + 1.0) * 0.5).collect();
let b_data: Vec<f64> = (0..size * size).map(|i| (i as f64 + 1.0) * 0.25).collect();
let a = Matrix::new(a_data, size, size);
let b = Matrix::new(b_data, size, size);
// Time naive
let start = Instant::now();
let c_naive = a.multiply(&b);
let naive_time = start.elapsed();
// Time tiled
let start = Instant::now();
let c_tiled = a.multiply_tiled(&b, block_size);
let tiled_time = start.elapsed();
// Verify correctness
assert_eq!(c_naive.shape(), c_tiled.shape());
for i in 0..size {
for j in 0..size {
let diff = (c_naive.get(i, j) - c_tiled.get(i, j)).abs();
assert!(
diff < 1e-10,
"Mismatch at ({}, {}): naive={}, tiled={}",
i, j,
c_naive.get(i, j),
c_tiled.get(i, j)
);
}
}
println!("Size {}×{}, block_size={}:", size, size, block_size);
println!(" Naive: {:?}", naive_time);
println!(" Tiled: {:?}", tiled_time);
println!(" Speedup: {:.2}×", naive_time.as_secs_f64() / tiled_time.as_secs_f64());
}
Run it from main:
fn main() {
Matrix::benchmark_multiply(128, 32);
Matrix::benchmark_multiply(256, 32);
Matrix::benchmark_multiply(512, 32);
}
Add this to src/main.rs:
use ch03::Matrix;
fn main() {
println!("Benchmarking naive vs tiled matrix multiplication\n");
for &size in &[64, 128, 256] {
Matrix::benchmark_multiply(size, 32);
println!();
}
}
Run it:
cargo run --release
The --release flag is essential — without optimizations, the naive algorithm can sometimes be faster because the compiler doesn't optimize the tiled version's loop structure as well. With --release, you should see the tiled version pull ahead, especially for larger matrices.
Example output (actual numbers depend on your CPU):
Size 64×64, block_size=32:
Naive: 3.2ms
Tiled: 1.1ms
Speedup: 2.91×
Size 128×128, block_size=32:
Naive: 21.4ms
Tiled: 6.5ms
Speedup: 3.29×
Size 256×256, block_size=32:
Naive: 178.3ms
Tiled: 47.2ms
Speedup: 3.78×
The speedup grows with matrix size because larger matrices amplify the cache-miss penalty in the naive version.
Verification
The benchmark_multiply method already verifies correctness by comparing every element of the naive and tiled results. Add these focused tests as well:
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_tiled_matches_naive_small() {
let a = Matrix::new(vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0], 2, 3);
let b = Matrix::new(vec![7.0, 8.0, 9.0, 10.0, 11.0, 12.0], 3, 2);
let naive = a.multiply(&b);
let tiled = a.multiply_tiled(&b, 2);
assert_eq!(naive, tiled);
}
#[test]
fn test_tiled_matches_naive_square() {
let a = Matrix::new(vec![1.0, 2.0, 3.0, 4.0], 2, 2);
let b = Matrix::new(vec![5.0, 6.0, 7.0, 8.0], 2, 2);
let naive = a.multiply(&b);
let tiled = a.multiply_tiled(&b, 2);
assert_eq!(naive, tiled);
}
#[test]
fn test_tiled_various_block_sizes() {
// Generate a 6×6 matrix with sequential values
let data: Vec<f64> = (0..36).map(|i| i as f64).collect();
let a = Matrix::new(data.clone(), 6, 6);
let b = Matrix::new(data, 6, 6);
let naive = a.multiply(&b);
for &block_size in &[1, 2, 3, 4, 6] {
let tiled = a.multiply_tiled(&b, block_size);
assert_eq!(
naive, tiled,
"Tiled multiplication failed for block_size={}",
block_size
);
}
}
#[test]
fn test_tiled_non_square() {
// 4×3 times 3×5
let a = Matrix::new((0..12).map(|i| i as f64).collect(), 4, 3);
let b = Matrix::new((0..15).map(|i| i as f64).collect(), 3, 5);
let naive = a.multiply(&b);
let tiled = a.multiply_tiled(&b, 3);
assert_eq!(naive, tiled);
}
#[test]
fn test_tiled_identity() {
let a = Matrix::new(vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0], 2, 3);
let i = Matrix::identity(3);
let naive = a.multiply(&i);
let tiled = a.multiply_tiled(&i, 2);
assert_eq!(naive, tiled);
}
}
Run tests:
cargo test
You should see:
running 5 tests
test tests::test_tiled_matches_naive_small ... ok
test tests::test_tiled_matches_naive_square ... ok
test tests::test_tiled_non_square ... ok
test tests::test_tiled_identity ... ok
test tests::test_tiled_various_block_sizes ... ok
test result: ok. 5 passed; 0 failed; 0 ignored; 0 measured; 0 filtered
Walkthrough
-
Tiling — The outer three loops (
i,j,k) iterate over tile positions, stepping byblock_size. Each triple-nested-tile is a small matrix multiplication that fits in cache. -
Inner loop order
i, k, j— For each element $A_{ik}$, we iterate over all $j$ columns in the output tile. This means $\mathbf{B}$ is accessed sequentially (row-major within the tile), and $\mathbf{C}$ accumulates in a fixed row. Compare with the naive orderi, j, k: there, for each $C_{ij}$, we iterate over $k$, which strided across $\mathbf{B}$'s columns. -
step_by— Rust'sIterator::step_bymethod gives us clean tile-boundary iteration without manual offset arithmetic. -
minfor edge tiles — When the matrix dimension isn't a multiple ofblock_size, the last tile in each dimension is smaller.(i + block_size).min(m)handles this without special cases. -
In-place accumulation — We initialize $\mathbf{C}$ with zeros, then accumulate results directly. A tile of $\mathbf{C}$ stays in cache while we sum over the $k$ dimension. This is the key to the performance gain.
-
No
getcalls — The tiled version accessesself.dataandother.datadirectly rather than through theget()method. This avoids bounds-check overhead in the innermost loop — a common optimization for hot paths.
Key Takeaways
- Naive matrix multiplication is memory-bound — the CPU stalls waiting for data from RAM.
- Tiling reorganizes the computation to reuse data in the CPU cache, multiplying small blocks at a time.
- The block size $b$ is chosen so that two input tiles plus one output tile fit in cache: $3 \times b^2 \times 8 \text{ bytes} \leq \text{L1 cache size}$.
- The inner loop order
i, k, jensures sequential memory access pattern within each tile. - Tiled multiplication is mathematically identical to naive multiplication — same formula, different order of summation.
- Always benchmark with
--releasefor meaningful comparisons.