Cohort 1 starts July 1, 2026 — Limited seats

The GPU Engineering
Course for the AI Era

From GPU architecture to CUDA kernels to distributed training at scale. Six hands-on modules built for engineers who need to operate at the hardware layer.

6 Modules
30+ Lab exercises
100% Hands-on
Limited Seats available
Technologies covered

Who it's for

Built for people who work close to the metal.

Students working hands-on with computer hardware in a lab environment

ML Engineers

Training models that are hitting memory walls or speed limits. You need to go below the framework and understand what's actually happening on the hardware.

Platform & DevOps Engineers

Managing GPU clusters, Kubernetes nodes, or bare-metal servers for AI teams. You want to monitor, debug, and operate GPU workloads with confidence.

CS Students & Researchers

Learning GPU programming for research or a career in AI infrastructure. You want practical experience, not just theory — with labs that mirror real production environments.

Full Curriculum

Six modules. Zero fluff.
Built for practitioners.

Register Interest
01
Foundation ~4 hrs

GPU Systems Fundamentals

Understand every layer of the NVIDIA software stack — from silicon to container. The mental model that makes everything else click.

GPU Architecture basics SMs, warps, thread blocks, memory hierarchy
CUDA ecosystem How NVIDIA driver, CUDA toolkit, and cuDNN fit together
Docker + NVIDIA Container Toolkit Containerized GPU environments from scratch
GPU Architecture CUDA Ecosystem NVIDIA Driver CUDA Toolkit cuDNN Docker NVIDIA Container Toolkit
02
Performance ~5 hrs

PyTorch Performance

Move from "it runs" to "I know why it's slow." Profile memory, fix bottlenecks, and squeeze real throughput from your training loop.

Tensors & the training loop GPU tensor lifecycle, grad computation, data loading
Mixed precision FP16/BF16, autocast, gradient scaling
Profiling PyTorch Profiler, Nsight, reading traces
Tensors GPU Training Loop Mixed Precision Memory Usage Profiling
03
Engineering ~6 hrs

CUDA for Engineers

Write your first CUDA kernels in Python. Understand memory transfer, shared memory, and the performance traps that trip up real engineers.

Kernel programming Thread indexing, parallelism, occupancy
Memory transfer Host↔device, pinned memory, PCIe bandwidth
Numba / CUDA Python Write real kernels without leaving Python
CUDA Kernels Memory Transfer Numba CUDA Python Performance Pitfalls
04
Scale ~5 hrs

Distributed Training

Train across multiple GPUs and nodes. Understand the communication layer — NCCL, NVLink, NVSwitch — and why your multi-GPU job is stalling.

DDP & FSDP PyTorch distributed, gradient sync, bucket sizing
NCCL internals AllReduce, ring topology, NCCL tuning
NVLink / NVSwitch GPU-to-GPU interconnect, bandwidth vs PCIe
DDP NCCL NVLink NVSwitch Data Parallelism Model Parallelism
05
Operations ~4 hrs

GPU Operations

Operate shared GPU environments like a pro. Schedule jobs, monitor utilization, detect failures before they cascade, and read logs that actually tell you something.

Slurm scheduler sbatch, squeue, GPU partitions, preemption
GPU monitoring nvidia-smi, DCGM, utilization vs saturation
Failure detection XID errors, ECC, health checks, alert patterns
Slurm Job Scheduling GPU Monitoring Failure Detection Health Checks
06
Capstone Full workload

Optimize a Real Training Workload

Apply everything. Take a real training job, profile it end-to-end, identify and fix bottlenecks, and produce a GPU performance report that explains the tradeoffs.

Optimize throughput

Apply mixed precision, profiling, and kernel-level fixes to a real model

GPU performance report

Document findings, show evidence, make data-driven recommendations

Portfolio artifact

A completed report you can show in interviews or use internally

Course Format

Live cohorts. Real labs. GPU access included.

Starts July 1, 2026 Pleasanton, CA Limited seats

The first cohort will be live instructor-led sessions with lab time after each module. We are measuring demand to finalize schedule, depth, and pricing.

Students get access to GPU lab environments — no cloud account required for labs.

Reserve your spot
Weekend live cohort Saturday sessions, recorded for async replay
~24 hours of content Spread across 6 modules with lab time
GPU lab access Hands-on environments provided, no setup required
Certificate of completion With capstone performance report artifact

We are calibrating based on

  • Schedule preference (weekend vs weeknight)
  • Current experience level and role
  • Interest in team / corporate seats
  • Budget signals for first cohort pricing

Join the Waitlist

Shape the first NullHz cohort.

Starts July 1, 2026 Pleasanton, CA Limited seats

Submit your interest below. Your responses directly inform how the first cohort is structured — schedule, depth, pricing, and lab environment.

Early access pricing before public launch
Priority enrollment — seats are limited
In-person in Pleasanton, CA with GPU labs