memxlife

MLSYS Course Project

With the dawn of the Agent Era, the landscape of systems engineering is undergoing a fundamental paradigm shift. Traditionally, tasks such as operator optimization and high-performance GPU tuning were reserved for a small circle of elite experts who possessed deep, vertical knowledge spanning from high-level algorithms down to the intricacies of GPU hardware architecture. In the past, completing projects of this complexity at scale would have required a workforce of thousands.

Today, that barrier is dissolving. We are entering a time where a single individual, empowered by intelligent agents, can perform the work that once required entire departments. By designing systems that can perceive hardware behavior, reason through performance bottlenecks, and iteratively improve code, we can automate the “expert” layer of AI infrastructure. This is the core objective of this semester’s project: to move from manually writing code to designing autonomous systems that can write, evaluate, and optimize infrastructure themselves—covering GPU profiling, CUDA kernel auto-tuning, and automated LLM infra generation.


Phase 1: GPU Performance Analysis Guide: Identifying Bottlenecks with ncu

Deadline 8am Apr. 21 2026

NVIDIA Nsight Compute (ncu) is an interactive kernel-level profiling tool for CUDA kernels. Unlike Nsight Systems (nsys), which observes the global timeline, ncu dives deep into the internal execution of each kernel to show how hardware resources are being consumed.

In this project, your Agent needs to analyze the metrics output by ncu to determine the bottleneck of an operator (e.g., Matrix Multiplication). Below are the core metric categories and their significance in performance tuning.

1.1 Core Overview Metrics

Before performing a detailed analysis, first use the Roofline Model to determine if the operator is Compute-Bound or Memory-Bound.

1.2 Memory Hierarchy Metrics

If the operator is memory-intensive, you need to identify which layer of the storage hierarchy is the bottleneck.

1.3 Compute Unit Metrics

If the operator is compute-bound, you need to identify which specific compute units are active.

1.4 Occupancy and Scheduling

Sometimes hardware utilization is low because threads are failing to “fill up” the GPU.

1.5 Common Bottlenecks & Diagnostic Table

Bottleneck Type Key Metrics Optimization Direction
VRAM Bound dram__throughput > 70% Reduce memory access; increase data reuse; use Shared Memory.
Compute Bound High sm__throughput, high tensor_op_hmma Consider algorithmic improvements or reduce precision (e.g., FP32 to FP16/BF16).
Uncoalesced Access l1tex__t_sectors_pipe_lsu_mem_global_op_ld too high Check memory access patterns; ensure adjacent threads access adjacent addresses.
Warp Divergence sm__sass_thread_inst_executed_per_inst_executed < 32 Reduce if/else branches; ensure threads within a warp follow the same path.
Bank Conflict l1tex__data_bank_conflicts_pipe_lsu.sum > 0 Adjust Shared Memory indexing (e.g., using Padding).

1.6 Advice for Students: How to Let the Agent Use This Data?

  1. Step 1: Get the Roofline. Have the Agent read sm__throughput and gpu__compute_memory_throughput first.
  2. Step 2: Characterize.
    • If Memory % > Compute %, dive into dram and l1/l2 metrics.
    • If Compute % > Memory %, check if tensor_op is being triggered.
  3. Step 3: Look for Anomalies. Check if Occupancy is too low or if Bank Conflict exists.
  4. Step 4: Map back to Code. Relate these metrics to your CUDA kernel code (e.g., is the loop unrolling insufficient? Is a __shared__ array missing?).

How to acquire these metrics via Command Line?

# Example: Get all detailed metrics for a Matrix Multiplication kernel
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed,gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed,sm__pipe_tensor_op_hmma_cycle_active.avg.pct_of_peak_sustained_active ./your_executable

1.7 Hardware Intrinsic Profiling: Probing the GPU “DNA”

In this advanced phase of the project, your Agent is no longer just a passive reporter of existing kernel performance. It must act as a Hardware Probe. The goal is to “reverse-engineer” the physical characteristics and architectural limits of the underlying GPU by autonomously generating and analyzing micro-benchmarks.

1.7.1 Probing Objectives (Target Metrics)

Your Agent will be given a task to identify the following hardware-intrinsic parameters:

  1. Memory Latency Hierarchy: Measure the exact access cycles for L1 Cache, L2 Cache, and DRAM. This requires the Agent to generate “Pointer Chasing” kernels to bypass hardware prefetchers.
  2. Effective Peak Bandwidth: Determine the maximum achievable throughput for Shared Memory and Global Memory (VRAM) under current conditions.
  3. L2 Cache Capacity: Identify the “cliff” in the latency-vs-size curve to pinpoint the exact physical size of the L2 cache.
  4. Actual Boost Frequency: Report the stable core clock frequency (MHz) while the GPU is under sustained compute load.
  5. Resource Penalties: Quantify the latency cost of a bank conflict in Shared Memory compared to a conflict-free access.

1.7.2 Submission & Evaluation Workflow

1.7.3 Anti-Hacking & Environment Variations

To ensure your Agent is performing real hardware analysis rather than performing a simple “table lookup” (e.g., searching for “A100 specs”), the evaluation environment will be dynamically altered:

  1. Non-Standard Frequency Locking: The GPU core and memory clocks may be locked at arbitrary, non-standard frequencies (e.g., 825 MHz instead of 1410 MHz) using nvidia-smi. A static lookup table will provide incorrect bandwidth/GFLOPS results.
  2. Resource Masking (SM Limiting): The system may restrict the kernel execution to a subset of SMs or limit the available memory per block via CUDA environment variables.
  3. Instruction Set Restrictions: Standard API calls like cudaGetDeviceProperties may be intercepted or report virtualized/misleading data.

Recommendation: Your Agent should adopt a Multi-Strategy Fusion approach—combining low-level micro-benchmarking (writing small C++/CUDA probes), binary execution, and ncu metric analysis to cross-verify its findings.

1.7.4 LLM-Based Evaluation and Scoring

To ensure a holistic assessment that covers both numerical precision and engineering reasoning, this project employs an LLM-as-a-Judge framework for grading.

The Grading Process: Upon submission, the evaluation system will feed the following three components into a high-capability Large Language Model (e.g., Gemini 3.1 Pro):

  1. Student Agent Output: The final results.json containing the identified hardware metrics and the reasoning/logs provided by your Agent.
  2. Ground Truth Data: The exact hardware parameters measured by our reference benchmarks under the specific environment (including any active frequency locks or resource masking).
  3. Experimental Evidence: Summaries of the ncu traces and micro-benchmarks generated during the Agent’s execution.

Scoring Rubric (100 Points Total):