Chapter 5. CUDA Programming as Hardware-Software Co-Optimization

From Naive Matrix Multiplication to Hierarchical Tiling

This chapter reconstructs the lecture as a coherent book-style narrative by aligning the spoken transcript with the slide flow. The lecture begins with the central thesis that CUDA programming is fundamentally a **mapping problem**: high-dimensional tensor workloads must be mapped onto linear and hierarchical hardware resources. The slides then develop that thesis through one running example, matrix multiplication, and progressively introduce the key optimization ideas: warp-aware thread mapping, global-memory coalescing, shared-memory tiling, register tiling, memory vectorization, warp tiling, and shared-memory bank-conflict management. The overall logic of the chapter therefore follows the same sequence as the lecture itself, but expands the explanation, fills in technical gaps, and makes explicit the systems reasoning that is sometimes only implicit in classroom delivery.

In the previous lecture, we studied GPU microarchitecture. We examined the computational resources of the machine, including CUDA cores and tensor cores, and we examined the storage and communication hierarchy, from HBM to on-chip caches, shared memory, and registers. That earlier discussion matters because CUDA optimization only becomes intellectually clear when one stops treating it as a syntax problem and starts treating it as a machine-mapping problem. The kernel launch syntax, the `__global__` keyword, the block dimensions, and the thread indices are only surface manifestations. The deeper issue is always the same: how should a high-dimensional workload be laid out and scheduled so that the actual GPU hardware sees data and computation in the form it can execute efficiently?

That is why the lecture insists on one take-home message from the beginning: **CUDA programming is hardware-software co-optimization.** The software is not merely an algorithm written down in CUDA. It is a particular mapping of an algorithm onto the GPU’s execution and memory hierarchy. If that mapping aligns with the hardware, performance can be excellent. If the mapping is awkward, even a theoretically beautiful workload can run very badly.

The example that makes all of this vivid is matrix multiplication. Matrix multiplication is one of the most important kernels in machine learning, and it is also one of the best possible pedagogical examples because it has two seemingly contradictory properties. On the one hand, it is mathematically simple and has excellent theoretical arithmetic intensity. On the other hand, once one begins implementing it on a GPU, it immediately exposes almost every important systems issue: data-layout mismatch, execution mismatch, memory coalescing, data reuse, storage placement, bank conflicts, and the tradeoff between locality and parallelism. Matrix multiplication is therefore not just a kernel. It is a microscope through which CUDA programming can be understood.

1. CUDA programming is a mapping problem

The lecture begins by stating the problem in an intentionally abstract way. The workload in machine learning is usually a tensor computation. A matrix is a two-dimensional tensor. A batch of matrices is a higher-dimensional tensor. In convolutional and transformer workloads, tensors often carry even richer logical structure. The neighborhood relations of these tensors are naturally multi-dimensional. In a matrix, an element has row-wise and column-wise neighbors. In an image tensor, a pixel has spatial neighbors. In a batched tensor, locality may exist simultaneously across batch, channel, height, and width.

The hardware, however, does not expose that structure directly. Memory is addressed linearly. At the lowest level, DRAM is not “two-dimensional mathematics”; it is an addressable storage medium organized around rows, columns, pages, and bursts. Shared memory is small, explicitly managed SRAM. Registers are private to threads. The GPU scheduler does not execute a two-dimensional grid as an abstract geometric object. It executes one-dimensional warps of 32 lanes in lockstep. This is the first deep conceptual tension in CUDA: the workload is multi-dimensional, but the underlying mechanisms that move and execute it are much closer to one-dimensional, linear, and hierarchical structures.

The lecture then identifies three specific mismatches that arise from this fact. The first is **data-layout mismatch**. Multi-dimensional tensors must be placed into linear memory. As soon as we choose a layout such as row-major storage, we preserve locality along one dimension and weaken locality along another. The second is **execution mismatch**. We may write code using two-dimensional or three-dimensional thread coordinates, but the hardware executes one-dimensional warps. That means the mapping from `threadIdx.{x,y,z}` to the problem domain is not superficial; it determines which threads become neighbors inside a warp and therefore which memory addresses are accessed together. The third is **communication placement**. The same arithmetic can have drastically different cost depending on where data movement occurs. Reusing a value already in a register is extremely cheap. Reusing a value in shared memory is also good, though not as cheap. Fetching repeatedly from HBM is far more expensive. The arithmetic is unchanged, but the communication substrate completely changes the observed performance. The opening slide of the deck states exactly this framing: CUDA programming is about mapping high-dimensional workloads onto linear and hierarchical hardware, and performance is shaped by data layout, warp execution, and the placement of communication across HBM, shared memory, and registers.

This abstract framing is not merely philosophical. It predicts every optimization that comes later. Once one understands that CUDA optimization is really the progressive removal of data-layout mismatch, execution mismatch, and poor communication placement, the whole lecture becomes conceptually unified.

2. The running example: a naive matrix-multiplication kernel

To make the discussion concrete, the lecture specializes to a square matrix multiplication:

$$ C = A \times B + C, $$

where $A$, $B$, and $C$ are all $N \times N$ matrices stored in row-major order. This is a standard GEMM-like operation. The natural thread-mapping strategy is obvious: assign one thread to one element $C_{ij}$. Then that thread reads row $i$ of $A$, column $j$ of $B$, performs the dot product, adds the previous value of $C_{ij}$, and writes the final result.

A minimal CUDA kernel for that idea is straightforward:

__global__ void sgemm_naive(
    int M, int N, int K,
    float alpha,
    const float* A,
    const float* B,
    float beta,
    float* C) {

    const unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
    const unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x < M && y < N) {
        float tmp = 0.0f;
        for (int k = 0; k < K; ++k) {
            tmp += A[x * K + k] * B[k * N + y];
        }
        C[x * N + y] = alpha * tmp + beta * C[x * N + y];
    }
}

This kernel looks perfectly reasonable. It is mathematically faithful. It is easy to understand. It uses a two-dimensional grid and two-dimensional blocks because the output matrix is itself two-dimensional. Every thread gets a clear responsibility. The inner loop is just the textbook dot product. There is no functional error, and this is exactly why the example is so instructive: the code is correct, the workload is theoretically excellent, and the performance is still terrible. The problem is not correctness. The problem is mapping.

This is the first moment where CUDA begins to differ from ordinary algorithm design. In a standard algorithms course, a functionally correct matrix multiplication of this form is already the main achievement. In high-performance GPU programming, functionally correct code is just the beginning. The real question is whether the mapping induces a dataflow that respects the machine.

3. On paper, matrix multiplication should be compute-bound

The lecture then asks an apparently simple but extremely revealing question: is matrix multiplication compute-bound or memory-bound?

At the mathematical level, matrix multiplication seems like a dream workload for the GPU. The arithmetic work grows as $O(N^3)$, while the amount of stored data grows as $O(N^2)$. As the matrix size grows, the arithmetic intensity improves. The slides make this concrete by using an H100 SXM GPU and a $4096 \times 4096$ matrix multiplication. The H100 offers about 80 GB of HBM3 with roughly $4.096\ \text{TB/s}$ of peak memory bandwidth, and about $60\ \text{TFLOP/s}$ of FP32 CUDA-core throughput, with even higher tensor-core throughput in lower-precision formats.

For $C = A \times B + C$ with $A, B, C \in \mathbb{R}^{4096 \times 4096}$, the total floating-point work is approximately

$$ 2 \cdot 4096^3 + 4096^2 \approx 137\ \text{GFLOPs}, $$

where the factor of two comes from one multiply and one add per dot-product step, and the final $4096^2$ term accounts for the addition of the old $C$. The slides explicitly state this number.

The minimum amount of data that must be read is the three input matrices $A$, $B$, and the original $C$, which together cost about

$$ 3 \cdot 4096^2 \cdot 4\ \text{B} \approx 201\ \text{MB}, $$

and writing the output $C$ adds another

$$ 4096^2 \cdot 4\ \text{B} \approx 67\ \text{MB}. $$

So the total lower-bound traffic is about $268\ \text{MB}$. Again, the slide deck provides exactly these numbers.

From there, the idealized lower bounds are easy to compute. If one divides 268 MB by 4.096 TB/s, the transfer lower bound is around $0.07\ \text{ms}$. If one divides 137 GFLOPs by 60 TFLOP/s, the compute lower bound is around $2.3\ \text{ms}$. So the paper calculation says that compute time is roughly thirty times larger than the pure memory-transfer lower bound. The lecture therefore concludes, correctly at the algorithmic level, that GEMM should be compute-bound rather than bandwidth-bound. The slide on this point says exactly that: the operation is compute-bound, not bandwidth-bound.

This is a perfect place to connect the lecture to the roofline model. The roofline says that performance is bounded by

$$ P \le \min(P_{\text{peak}}, BW_{\text{peak}}\cdot I), $$

where $P_{\text{peak}}$ is the peak arithmetic throughput, $BW_{\text{peak}}$ is the peak bandwidth, and $I$ is arithmetic intensity in FLOPs per byte. GEMM has high intensity in theory, so it should sit near the compute roof rather than the memory roof.

And yet that is not what the naive kernel does.

4. Welcome to the real world: the two-order-of-magnitude gap

The lecture then transitions from theory to measurement. The slide literally says, “Welcome to the real world!” It reports that although the H100 FP32 CUDA cores offer about $60\ \text{TFLOP/s}$, the naive kernel reaches only about $500\ \text{GFLOP/s}$, with total runtime exceeding 200 ms for the $4096 \times 4096$ case.

This is not a small inefficiency. It is a performance collapse. The measured throughput is more than a hundred times below the FP32 peak. The lecture quite rightly treats this as a debugging puzzle. The algorithm is correct. The arithmetic workload is large and highly parallel. The paper calculation predicts a compute-bound workload. So why does the implementation behave so badly?

The answer is subtle but foundational: the paper calculation only counts **logical** data movement, not the **physical** data movement induced by the implementation. The lower-bound traffic estimate assumes that every byte delivered by memory is useful, and that data are reused when they should be reused. The naive kernel violates both assumptions. It generates poor access patterns that waste bandwidth, and it repeatedly reloads values that should have been shared. Once that happens, the realized arithmetic intensity collapses, and a theoretically compute-bound workload can behave like an inefficient memory kernel.

This is why the lecture does not immediately jump to “use shared memory” as a magic trick. It first asks a more basic question: do we really understand memory?

5. DRAM, HBM, row buffers, and why coalescing exists

The transcript spends substantial time unpacking the physical reality of DRAM, and this is exactly the right move. If one skips that part, coalescing sounds like a memorized CUDA rule. Once one understands DRAM, coalescing becomes inevitable.

At the circuit level, a DRAM cell is just a capacitor plus a transistor. The capacitor stores a tiny amount of charge. The transistor, controlled by the wordline, connects the cell to a bitline. Reading the cell means connecting the tiny cell capacitor to the much larger bitline and sensing the resulting voltage perturbation. Because the stored charge is tiny, the signal is weak and must be amplified. Worse, the read is effectively destructive, so the value has to be restored after sensing.

This leads to the next architectural fact: DRAM is not optimized for reading isolated scalar words one by one. It is optimized around **rows**. An access activates a row, transfers it into the sense amplifiers, and then serves burst reads from that temporary row buffer. Once a row is open, nearby accesses are cheap. Opening another row is expensive because the system must precharge, activate, sense, and later restore that new row.

This is the key insight. The expensive operation in DRAM is not “fetch one float.” The expensive operation is “activate a new row.” The memory system therefore wants each row activation to serve as many useful nearby requests as possible.

The slide deck’s HBM section makes this concrete. A bank has a row buffer or page on the order of 1–2 KB, and burst transfers return chunks from the active row. Channels, stacks, and many banks operate in parallel to build up the aggregate HBM bandwidth. But all of that impressive bandwidth depends on the software giving the hardware addresses that let those row buffers and burst transfers be used efficiently.

This is why stride matters so much. The slides include a bandwidth experiment in which small stride gives about $1418\ \text{GB/s}$, while pathological stride—effectively consuming only one word out of a whole opened page—drops performance to around $111\ \text{GB/s}$, only 8% of peak in that test. The measured gap is on the order of 12–13×.

Now the real meaning of coalescing becomes clear. Coalescing is not a stylistic programming preference. It is the software-side condition that lets the DRAM/HBM system satisfy a warp’s requests with a small number of dense burst transactions from open rows. If the warp’s requests are contiguous, each fetched chunk contains mostly useful data. If the warp’s requests are strided, far-apart, or scattered, the hardware opens rows and moves data that the kernel barely uses. That is why performance collapses.

This gives us the first diagnostic lens for the matrix-multiplication kernel: not “what does one thread do?” but “what addresses do 32 warp lanes request together?”

6. Warps, lockstep execution, and why thread-local locality is not enough

The next critical layer of the lecture is the warp execution model. CUDA programmers often write two-dimensional or three-dimensional thread blocks because the problem itself is two-dimensional or three-dimensional. But the hardware does not execute “a 2D block” directly. It executes **warps** of 32 threads, and the lanes in a warp are assigned in a linear order: first `threadIdx.x`, then `threadIdx.y`, then `threadIdx.z`. The slide on warp organization emphasizes exactly this rule.

This matters because what the memory system sees is not the access pattern of one thread. It sees the combined access pattern of all 32 lanes in the warp, issued together in lockstep. A single thread may appear to have a perfectly reasonable access pattern and still belong to a warp whose collective access pattern is terrible.

That is exactly what happens in the naive matrix multiplication mapping.

If one looks only at a single thread computing $C_{ij}$, its accesses appear mixed but understandable. It walks along row $i$ of $A$, which is contiguous in row-major memory, and it walks down column $j$ of $B$, which is strided and therefore not local. At the single-thread level, $A$ looks good and $B$ looks bad.

But that is not the question the hardware asks.

Suppose adjacent lanes in the warp correspond to adjacent **rows** of $C$ while keeping the same column $j$. Then in one step of the inner $k$-loop, the 32 lanes read:

- $A[i, k], A[i+1, k], A[i+2, k], \dots$, which is a **column** of $A$ in row-major storage, hence highly strided; - and the same value $B[k, j]$, which can often be served by broadcast-like behavior.

This is the deep bug in the natural mapping. Thread-local reasoning says, “A is row-wise, so A should be fine.” Warp-level reasoning says, “At a fixed inner-loop iteration, the warp actually walks down a column of A.” That completely destroys global-memory locality. The slide sequence “A single thread,” “A warp of thread (Why is this bad?),” and “This is what we get” visualizes exactly this mismatch.

The important conceptual lesson is this:

**Thread-local locality is not enough.** For global memory, what matters is **warp-local locality**.

7. The first real optimization: align the warp with rows, not columns

Once the problem is stated at the warp level, the fix becomes surprisingly simple. Instead of letting the warp cover a column-wise strip of the output $C$, make the warp cover a **row-wise strip**.

Now consider one inner-loop iteration again. All lanes in the warp use the same $k$. If they share the same output row $i$ and cover adjacent output columns $j, j+1, \dots, j+31$, then:

- all lanes need the same scalar $A[i,k]$, which is naturally broadcast-friendly; - and all lanes need consecutive elements $B[k,j], B[k,j+1], \dots, B[k,j+31]$, which form a contiguous segment of row $k$ of $B$.

This is exactly what the memory system wants. One operand becomes broadcast-like, and the other becomes coalesced. The slide “A warp of thread (what we want)” shows precisely this improved warp-level access pattern.

The beautiful part of this optimization is how small the code change is relative to its effect. The arithmetic is unchanged. The loop structure is unchanged. The algorithm is unchanged. Only the mapping from thread coordinates to output coordinates changes. Yet the measured performance improves from about $0.5\ \text{TFLOP/s}$ to about $6.3\ \text{TFLOP/s}$, which is almost exactly the same factor as the 12–13× bandwidth loss observed in the pathological stride experiment. The slide deck explicitly states this new result and summarizes it as Lesson 1: data in global memory should be accessed contiguously for maximum bandwidth efficiency.

This is the first major pedagogical payoff of the lecture. A seemingly tiny index change generates a huge speedup because that tiny change encodes a large physical truth about HBM and warp execution. CUDA optimization often feels like “small code, giant effect” for exactly this reason.

It is also worth adding a useful interpretive point here. The algorithmic arithmetic intensity of GEMM did not change. What changed is the **realized arithmetic intensity** seen by HBM. Once more of each HBM transaction becomes useful, the kernel stops wasting bandwidth and starts moving toward the compute roof predicted by the paper model.

8. The next bottleneck: HBM is better, but data are still not shared

A 12× speedup is dramatic, but it is still far from H100 peak performance. So the next question is natural: where is the bottleneck now?

The slides answer with a simple observation: **data fetched by threads are not shared.**

This is the next major systems insight. Even after global-memory coalescing is fixed, the kernel still lets each thread fetch the data it needs from HBM into its own private registers. Registers are private. If neighboring threads need overlapping pieces of $A$ and $B$, that overlap is not exploited. The same values get loaded many times by different threads.

At the algorithmic level, GEMM should move $O(N^2)$ data and perform $O(N^3)$ arithmetic. But if every thread loads its own private copy of overlapping operands, the **effective** global-memory traffic can behave much closer to $O(N^3)$. This is exactly the type of communication-placement failure the opening slide warned about. The arithmetic has not changed, but the placement of reuse is completely wrong.

This is why the lecture next introduces shared memory.

9. Shared-memory tiling: using block-level cooperation to reduce HBM traffic

Shared memory is small, fast, on-chip storage that is visible to all threads in a block. The critical word is **shared**. Registers are private, so they cannot support inter-thread reuse. Shared memory can.

The idea of tiling is therefore conceptually straightforward. Suppose a block is assigned a tile of the output matrix $C$. To compute that output tile, all threads in the block need a corresponding tile from $A$ and a corresponding tile from $B$. Instead of letting every thread fetch the needed data independently from HBM, let the block load the tiles cooperatively into shared memory. Then every thread reuses those values from shared memory while accumulating partial results.

The slides break the process into explicit steps: determine the grid, block, and thread arrangement; determine which output tile of $C$ this block owns; determine the corresponding tile regions in $A$ and $B$; copy those tiles from global memory into shared memory; compute tile-wise partial results; repeat across the $K$ dimension; and finally write the result tile of $C$. The “Shared Memory — Step by step” slides list this sequence explicitly.

A simplified kernel skeleton looks like this:

__shared__ float As[BM][BK];
__shared__ float Bs[BK][BN];

float acc = 0.0f;

for (int tile = 0; tile < K; tile += BK) {
    As[ty][tx] = A[row * K + (tile + tx)];
    Bs[ty][tx] = B[(tile + ty) * N + col];
    __syncthreads();

    for (int k = 0; k < BK; ++k) {
        acc += As[ty][k] * Bs[k][tx];
    }
    __syncthreads();
}

The exact indexing depends on tile sizes and block layout, but the architecture is the same: HBM to shared memory once per tile, then many arithmetic operations reusing the tile on-chip.

This changes the communication picture dramatically. Each value brought from HBM into shared memory is now used by multiple threads in the block. A tile of $A$ is reused across many output columns. A tile of $B$ is reused across many output rows. In roofline language, the arithmetic intensity seen by HBM increases because the same number of HBM bytes now supports more FMAs.

The lecture reports that shared-memory tiling improves performance from about $6.3\ \text{TFLOP/s}$ to about $9\ \text{TFLOP/s}$, and the associated slide describes this as “arithmetic intensity improves.” Lesson 2 on the slide states the core idea succinctly: tiling facilitates intra-block reuse by exploiting the shared memory of each streaming multiprocessor.

There is also a deeper machine-learning-systems lesson here. Shared memory is software-managed, unlike L1 cache. That means the programmer or compiler is responsible for deciding what should live there. This is a form of explicit data orchestration. Once workloads become complex, deciding which reuse should be expressed through software-managed memory and which reuse can be left to hardware cache policy becomes an important systems question.

10. Shared memory is faster than HBM, but it is not free

Once shared-memory tiling is added, the bottleneck shifts again. This is an excellent example of hierarchical performance debugging. Optimizing one level of the hierarchy reveals the next one.

The profiling slides show that the kernel remains memory-bound, but now the pressure is on shared memory rather than on HBM. The slide on warp stalls explains that the warp is stalled because the MIO (memory input/output) instruction queue is full, caused by heavy shared-memory instruction traffic within the block. The “Lots of memory accesses — Still memory bound” slides emphasize exactly this new diagnosis.

Why does this happen? Because in the basic tiled kernel, each thread still computes only one output element. For every inner-loop step, it must fetch one value from the $A$ tile and one value from the $B$ tile out of shared memory into registers, perform one multiply-add, and repeat. Shared memory is much faster than HBM, but if the kernel generates enormous amounts of shared-memory traffic, then the shared-memory pipeline itself becomes the limiter.

This is the second major performance-debugging lesson of the lecture. The bottleneck has not disappeared. It has moved inward in the hierarchy.

The fix now is not to reduce HBM traffic further. The fix is to reduce **shared-memory traffic per unit of arithmetic**. That leads to register tiling.

11. Register tiling: make each thread compute more than one output

The lecture frames the next idea very well: if we want fewer shared-memory instructions, then each thread must do more work.

This is the basic intuition behind **register tiling**. In the current tiled kernel, a thread computes only one output $C_{ij}$. That means a value fetched from shared memory is used once and then discarded. But registers are the fastest storage on the machine. If each thread were responsible for multiple outputs, then a value loaded from shared memory into a register could be reused across several FMAs before being discarded. Shared-memory traffic per FLOP would fall.

11.1 One-dimensional register tiling

In one-dimensional register tiling, each thread computes a short vector of outputs rather than a single scalar. For example, instead of computing one output element in a column, it may compute several vertically adjacent outputs or several horizontally adjacent outputs, depending on how the block and tile are organized. The thread holds several partial sums in a small array of registers:

float threadResults[TM] = {0.0f};

Now when the thread loads one operand fragment, it can reuse that fragment across several accumulators. The “1D Register tiling” slides illustrate exactly this arrangement. The shared-memory tile is further decomposed into smaller fragments that become thread-local register tiles. The thread then computes several outputs rather than just one.

The performance impact is large. The lecture reports that with 1D register tiling, performance jumps from about $9\ \text{TFLOP/s}$ to almost $20\ \text{TFLOP/s}$, more than another 2× improvement. The corresponding slide says shared-memory bandwidth usage has been significantly improved.

11.2 Two-dimensional register tiling

One-dimensional register tiling is only the first step. Two-dimensional register tiling lets each thread compute a small patch of outputs rather than a vector. That means both the $A$-side and $B$-side operands can be reused more effectively within the thread.

At this point, each thread becomes a tiny matrix-multiplication microkernel. It loads a small fragment from $A$ and a small fragment from $B$, then performs a small outer product into a $T_M \times T_N$ tile of accumulators in registers. This is much closer to the structure of high-performance GEMM kernels in optimized libraries.

The lecture reports that 2D register tiling pushes the kernel further, to roughly $25.8\ \text{TFLOP/s}$. The slide explicitly says that shared-memory bandwidth usage is improved further with 2D tiling.

11.3 Why register tiling works

Register tiling works because it pushes reuse one level deeper into the hierarchy. Coalescing made HBM accesses efficient. Shared-memory tiling made HBM data reusable within a block. Register tiling makes shared-memory data reusable within a thread. The communication volume is not just reduced in an abstract sense; it is moved to cheaper and cheaper levels of the hierarchy.

There is, however, another tradeoff here. Register tiling increases **register pressure**. More per-thread work means more registers per thread, which may lower occupancy because fewer warps can reside simultaneously on the SM. This is one reason optimal GEMM tile sizes are not fixed universal numbers. They depend on the machine, the precision format, the available shared memory, and the target occupancy regime. The “best” configuration is always a compromise among reuse, resource footprint, and scheduler flexibility.

This is where CUDA programming starts to look very much like systems design rather than simple parallelization.

12. Memory access vectorization: moving more bytes per instruction

After coalescing, shared-memory tiling, and register tiling, the lecture briefly introduces **memory access vectorization**. This is another useful refinement.

Even if accesses are coalesced, loading one float at a time still creates more memory instructions than necessary. If alignment permits, we can load wider chunks such as `float2` or `float4`, reducing instruction count and improving payload per instruction.

The lecture also notes an important layout trick: one operand may be naturally row-oriented while the other is naturally column-oriented. During the copy from global memory into shared memory, one can rotate or transpose the staging layout so that later accesses from shared memory into registers become row-like and therefore more local and easier to vectorize. This is a very important systems insight. Shared memory is not just a faster copy of global memory. It can also act as a **layout transformation buffer**. The software uses the HBM-to-SMEM transfer not only to move data inward, but to reshape data into a better form for the next stage of the pipeline.

This is one of the less obvious but most useful ways to think about on-chip memories in optimized kernels. They are not only for storage. They are for reformatting data into a shape that the next level of the hierarchy prefers.

13. Warp tiling: make the warp itself a meaningful computational object

The lecture next introduces **warp tiling**, and this is another important conceptual upgrade.

So far, the discussion has focused on blocks and threads. But the hardware actually schedules **warps**. Therefore, a mature mapping should not just assign useful work to a block and to a thread; it should also assign useful work to a warp as a collective object.

The slide on warp tiling defines it explicitly: a warp collaboratively computes a sub-tile of the output matrix instead of each thread independently computing a single element. The purpose is to align computation with the warp execution unit, reduce unnecessary synchronization, exploit register reuse more effectively, and further raise arithmetic intensity.

This is a beautiful conceptual refinement because it closes the loop on the earlier “execution mismatch” idea. The hardware executes warps, not arbitrary two-dimensional thread abstractions. Therefore, good software should increasingly shape computation around warps as first-class entities.

Warp tiling naturally sits between block-level shared-memory tiling and thread-level register tiling. A block owns a large tile in shared memory. Within that block, each warp owns a smaller sub-tile. Within that warp, each thread owns an even smaller register tile. This three-level decomposition mirrors the actual execution and storage structure of the machine.

But the lecture also warns that warp tiling introduces new complications. Once a warp accesses shared memory in more structured ways, those access patterns may cause **bank conflicts**. This is where the phrase “the devil is in the details” becomes entirely appropriate.

14. Shared-memory bank conflicts: why fast memory can still be slow

Shared memory is fast because it is banked. On H100, shared memory is effectively organized into 32 banks. In an ideal case, the 32 lanes of a warp each access a different bank, and the whole warp load can complete in one cycle. If several lanes map to the same bank, the accesses serialize. In the worst case, all 32 lanes may contend for the same bank, and what should have taken one cycle can take many cycles instead.

The slide on H100 shared memory summarizes the idea clearly: conflict-free access takes one cycle, broadcast also takes one cycle, but a $k$-way conflict costs $k$ cycles, so performance depends on bank mapping efficiency rather than merely on memory size.

This introduces a new kind of locality problem. Even if data are “near” each other in shared memory, they may still be slow to access if they line up badly with the bank mapping. This often happens when data are written in one orientation and read in another. A warp may write a tile row-wise but later read it column-wise. If the stride interacts badly with the bank mapping, the access pattern can become highly conflicted.

A classic remedy is **padding**. If a shared-memory tile has width 32 and a warp reads it column-wise, all lanes can collide on the same bank. Adding one extra padding element per row changes the stride from 32 to 33, breaking the pathological alignment. Another remedy is to change the layout during staging, as already discussed in the vectorization section, so that later accesses become naturally bank-friendly.

The lecture’s point is exactly right: the performance of shared memory is governed not just by how much of it we use, but by **how the warp maps onto its banks**. Once a kernel is sufficiently optimized, these fine-grained details become load-bearing.

15. A unifying view of the optimization ladder

At this stage, the lecture has introduced a long list of specific techniques: coalescing, shared-memory tiling, register tiling, vectorized access, warp tiling, and bank-conflict avoidance. It is tempting to memorize them as isolated tricks. That would miss the deeper structure.

These techniques form a coherent optimization ladder:

- **Global-memory coalescing** fixes the mismatch between row-major layout and warp lane ordering. - **Shared-memory tiling** fixes the mismatch between algorithmic reuse and per-thread private registers. - **Register tiling** fixes the mismatch between block-level reuse and per-thread arithmetic throughput. - **Vectorization** fixes the mismatch between scalar instruction granularity and the memory system’s preferred transfer width. - **Warp tiling** fixes the mismatch between logical block decomposition and the warp as the actual execution unit. - **Bank-conflict management** fixes the mismatch between logical shared-memory locality and the physical bank structure of on-chip SRAM.

Seen from this angle, every optimization in the chapter is doing the same thing: it is removing one specific mismatch between the workload’s natural structure and the hardware’s actual structure.

That is why the opening frame of the lecture is so important. CUDA programming is not a bag of tricks. It is the disciplined process of mapping a high-dimensional workload onto linear and hierarchical hardware, and every performance problem can be traced back to a mismatch in that mapping.

16. Why we still do not match cuBLAS

By the end of the lecture, the optimized CUDA-core GEMM reaches roughly $25$–$26\ \text{TFLOP/s}$, which is a remarkable improvement over the naive $0.5\ \text{TFLOP/s}$. The total speedup is almost two orders of magnitude. The slide deck explicitly frames this as the cumulative result of putting all the pieces together: contiguous global-memory access, shared-memory reuse, warp- and register-level tiling, vectorization, and bank-aware shared-memory access.

And yet this still does not fully match the H100’s theoretical CUDA-core FP32 peak, much less what tensor-core paths can achieve. The slides jokingly summarize this as Lesson 4: it is hard to beat cuBLAS on its own turf.

Why is that true? There are several reasons.

First, production libraries such as cuBLAS do not merely use the same ideas we discussed; they use them with extraordinary maturity and tuning depth. They choose tile sizes carefully for each architecture and problem regime.

Second, they use **software pipelining** and often **double buffering**, so loading the next tile overlaps with computation on the current tile. On newer architectures, asynchronous copy instructions such as `cp.async` make this even more effective.

Third, they often target **tensor cores**, which operate under different instruction formats and deliver much higher peak throughput when the data type and tiling match the hardware.

Fourth, they exploit machine-specific scheduling, register blocking, instruction selection, and layout transforms at a depth that is very difficult to reproduce manually.

So the right educational interpretation of the lecture is not disappointment that the kernel is “still not cuBLAS.” The right interpretation is respect for why cuBLAS is hard to beat. Once one sees how many layers of mapping and reuse must be aligned simultaneously, the need for autotuning, compiler support, and vendor-optimized libraries becomes obvious.

This is also why the transcript ends by pointing toward the next lecture on ML compilers. Once one understands the complexity of kernel tuning, it becomes natural to ask whether at least part of this optimization ladder can be generated or searched automatically.

17. The real lessons of this chapter

We can now restate the deepest lessons of the lecture in a more book-like way.

The first lesson is that **theoretical arithmetic intensity is not the same as realized arithmetic intensity**. GEMM is mathematically compute-bound, but a poor mapping can make the implementation behave like a terrible memory kernel.

The second lesson is that **the unit of optimization is not the individual thread**. The real units are hierarchical: HBM pages, memory transactions, warps, thread blocks, shared-memory tiles, register tiles, and execution pipelines. Optimizing CUDA means understanding all of them.

The third lesson is that **reuse must be moved inward**. Reuse in HBM is too expensive. Reuse in shared memory is better. Reuse in registers is best. High-performance kernels are built by progressively pushing reuse down the hierarchy.

The fourth lesson is that **the geometry of the workload is not the geometry of execution**. A matrix is two-dimensional, but a warp is one-dimensional. Any thread mapping that ignores that fact is likely to waste bandwidth.

The fifth lesson is that **small code changes can encode very large hardware truths**. The first 12× speedup came from a tiny index remapping because that remapping fundamentally changed what a warp asked the memory system to do.

The sixth lesson is that **performance tuning is iterative bottleneck debugging**. First HBM was the problem. Then shared memory became the problem. Then shared-memory instruction pressure became the problem. Then bank mapping and transfer granularity mattered. There is no single trick. There is a sequence of revealed bottlenecks.

This is exactly why CUDA programming belongs in a machine-learning systems course. It teaches students that performance is not a property of algorithms alone. It is a property of algorithms as they are realized through a hardware-constrained execution environment.

18. Conclusion: CUDA programming is respectful mapping

The lecture’s central sentence can now be understood in its full depth:

**CUDA programming is the problem of mapping high-dimensional workloads onto linear and hierarchical hardware.**

That is not a slogan. It is the organizing principle behind every optimization in this chapter. Coalescing is what happens when a warp respects HBM’s burst-oriented organization. Shared-memory tiling is what happens when a block-level reuse opportunity is placed on chip. Register tiling is what happens when one realizes that shared memory itself can become the bottleneck. Warp tiling is what happens when one takes the hardware execution unit seriously. Bank-conflict avoidance is what happens when one respects the internal structure of the supposedly “fast” memory itself.

This is why CUDA programming is so beautiful intellectually. Every good optimization has a physical explanation. Every bad mapping has a physical cost. The software is not merely writing down arithmetic. It is shaping the arithmetic into a form the machine can honor efficiently.

That is also why the lecture is such a strong bridge to compilers. Once one sees how many details matter—layout, warp mapping, tiling, synchronization, vectorization, bank mapping, register pressure, occupancy—it becomes obvious that kernel generation and autotuning are not luxuries. They are almost inevitable. The search space is too rich to navigate by intuition alone at scale.

So the right final mental model is not “CUDA lets me launch many threads.” The right mental model is:

**CUDA lets me design a mapping between mathematical structure and machine structure.**

And the better that mapping respects the machine, the closer performance moves from theory toward reality.