With the dawn of the Agent Era, the landscape of systems engineering is undergoing a fundamental paradigm shift. Traditionally, tasks such as operator optimization and high-performance GPU tuning were reserved for a small circle of elite experts who possessed deep, vertical knowledge spanning from high-level algorithms down to the intricacies of GPU hardware architecture. In the past, completing projects of this complexity at scale would have required a workforce of thousands.
Today, that barrier is dissolving. We are entering a time where a single individual, empowered by intelligent agents, can perform the work that once required entire departments. By designing systems that can perceive hardware behavior, reason through performance bottlenecks, and iteratively improve code, we can automate the “expert” layer of AI infrastructure. This is the core objective of this semester’s project: to move from manually writing code to designing autonomous systems that can write, evaluate, and optimize infrastructure themselves—covering GPU profiling, CUDA kernel auto-tuning, and automated LLM infra generation.
Deadline 8am Apr. 21 2026
NVIDIA Nsight Compute (ncu) is an interactive kernel-level profiling tool for CUDA kernels. Unlike Nsight Systems (nsys), which observes the global timeline, ncu dives deep into the internal execution of each kernel to show how hardware resources are being consumed.
In this project, your Agent needs to analyze the metrics output by ncu to determine the bottleneck of an operator (e.g., Matrix Multiplication). Below are the core metric categories and their significance in performance tuning.
Before performing a detailed analysis, first use the Roofline Model to determine if the operator is Compute-Bound or Memory-Bound.
sm__throughput.avg.pct_of_peak_sustained_elapsed (Compute Utilization)
gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed (Memory Throughput)
ncu reports represent what percentage of the hardware’s theoretical peak performance the current operator has reached.If the operator is memory-intensive, you need to identify which layer of the storage hierarchy is the bottleneck.
l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum (L1 Cache)
l2__throughput.avg.pct_of_peak_sustained_elapsed (L2 Cache)
dram__throughput.avg.pct_of_peak_sustained_elapsed (VRAM/DRAM)
If the operator is compute-bound, you need to identify which specific compute units are active.
sm__pipe_tensor_op_hmma_cycle_active.avg.pct_of_peak_sustained_active (Tensor Cores)
sm__pipe_fma_cycles_active.avg.pct_of_peak_sustained_active (FP32/FMA)
sm__sass_thread_inst_executed_op_fp32_pred_on.sum
Sometimes hardware utilization is low because threads are failing to “fill up” the GPU.
sm__maximum_warps_per_active_cycle_pct (Theoretical Occupancy)
sm__warps_active.avg.pct_of_peak_sustained_active (Achieved Occupancy)
| Bottleneck Type | Key Metrics | Optimization Direction |
|---|---|---|
| VRAM Bound | dram__throughput > 70% |
Reduce memory access; increase data reuse; use Shared Memory. |
| Compute Bound | High sm__throughput, high tensor_op_hmma |
Consider algorithmic improvements or reduce precision (e.g., FP32 to FP16/BF16). |
| Uncoalesced Access | l1tex__t_sectors_pipe_lsu_mem_global_op_ld too high |
Check memory access patterns; ensure adjacent threads access adjacent addresses. |
| Warp Divergence | sm__sass_thread_inst_executed_per_inst_executed < 32 |
Reduce if/else branches; ensure threads within a warp follow the same path. |
| Bank Conflict | l1tex__data_bank_conflicts_pipe_lsu.sum > 0 |
Adjust Shared Memory indexing (e.g., using Padding). |
sm__throughput and gpu__compute_memory_throughput first.dram and l1/l2 metrics.tensor_op is being triggered.Occupancy is too low or if Bank Conflict exists.__shared__ array missing?).# Example: Get all detailed metrics for a Matrix Multiplication kernel
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed,gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed,sm__pipe_tensor_op_hmma_cycle_active.avg.pct_of_peak_sustained_active ./your_executable
In this advanced phase of the project, your Agent is no longer just a passive reporter of existing kernel performance. It must act as a Hardware Probe. The goal is to “reverse-engineer” the physical characteristics and architectural limits of the underlying GPU by autonomously generating and analyzing micro-benchmarks.
Your Agent will be given a task to identify the following hardware-intrinsic parameters:
target_spec.json containing the list of hardware metrics to be identified.
{"targets": ["dram_latency_cycles", "max_shmem_per_block_kb", "actual_boost_clock_mhz"]}results.json containing the identified numeric values.
{"dram_latency_cycles": 442, "max_shmem_per_block_kb": 48, ...}To ensure your Agent is performing real hardware analysis rather than performing a simple “table lookup” (e.g., searching for “A100 specs”), the evaluation environment will be dynamically altered:
nvidia-smi. A static lookup table will provide incorrect bandwidth/GFLOPS results.cudaGetDeviceProperties may be intercepted or report virtualized/misleading data.Recommendation: Your Agent should adopt a Multi-Strategy Fusion approach—combining low-level micro-benchmarking (writing small C++/CUDA probes), binary execution, and ncu metric analysis to cross-verify its findings.
To ensure a holistic assessment that covers both numerical precision and engineering reasoning, this project employs an LLM-as-a-Judge framework for grading.
The Grading Process: Upon submission, the evaluation system will feed the following three components into a high-capability Large Language Model (e.g., Gemini 3.1 Pro):
results.json containing the identified hardware metrics and the reasoning/logs provided by your Agent.ncu traces and micro-benchmarks generated during the Agent’s execution.Scoring Rubric (100 Points Total):