Cohort 1 starts July 1, 2026 — Limited seats

The GPU Engineering
Course for the AI Era

From GPU architecture to CUDA kernels to distributed training at scale. Six hands-on modules built for engineers who need to operate at the hardware layer.

Get Early Access View Curriculum

6 Modules

30+ Lab exercises

100% Hands-on

Limited Seats available

Six-module path July 1, 2026

01
GPU Systems Fundamentals Architecture · CUDA ecosystem · Docker
02
PyTorch Performance Profiling · mixed precision · memory
03
CUDA for Engineers Kernels · Numba · pitfalls
04
Distributed Training DDP · NCCL · NVLink · parallelism
05
GPU Operations Slurm · monitoring · health checks
06
Capstone Project Optimize + GPU performance report

Technologies covered

Who it's for

Built for people who work close to the metal.

Students working hands-on with computer hardware in a lab environment

ML Engineers

Training models that are hitting memory walls or speed limits. You need to go below the framework and understand what's actually happening on the hardware.

Platform & DevOps Engineers

Managing GPU clusters, Kubernetes nodes, or bare-metal servers for AI teams. You want to monitor, debug, and operate GPU workloads with confidence.

CS Students & Researchers

Learning GPU programming for research or a career in AI infrastructure. You want practical experience, not just theory — with labs that mirror real production environments.

Full Curriculum

Six modules. Zero fluff.
Built for practitioners.

Foundation ~4 hrs

GPU Systems Fundamentals

Understand every layer of the NVIDIA software stack — from silicon to container. The mental model that makes everything else click.

GPU Architecture basics SMs, warps, thread blocks, memory hierarchy

CUDA ecosystem How NVIDIA driver, CUDA toolkit, and cuDNN fit together

Docker + NVIDIA Container Toolkit Containerized GPU environments from scratch

GPU Architecture CUDA Ecosystem NVIDIA Driver CUDA Toolkit cuDNN Docker NVIDIA Container Toolkit

Performance ~5 hrs

PyTorch Performance

Move from "it runs" to "I know why it's slow." Profile memory, fix bottlenecks, and squeeze real throughput from your training loop.

Tensors & the training loop GPU tensor lifecycle, grad computation, data loading

Mixed precision FP16/BF16, autocast, gradient scaling

Profiling PyTorch Profiler, Nsight, reading traces

Tensors GPU Training Loop Mixed Precision Memory Usage Profiling

Engineering ~6 hrs

CUDA for Engineers

Write your first CUDA kernels in Python. Understand memory transfer, shared memory, and the performance traps that trip up real engineers.

Kernel programming Thread indexing, parallelism, occupancy

Memory transfer Host↔device, pinned memory, PCIe bandwidth

Numba / CUDA Python Write real kernels without leaving Python

CUDA Kernels Memory Transfer Numba CUDA Python Performance Pitfalls

Scale ~5 hrs

Distributed Training

Train across multiple GPUs and nodes. Understand the communication layer — NCCL, NVLink, NVSwitch — and why your multi-GPU job is stalling.

DDP & FSDP PyTorch distributed, gradient sync, bucket sizing

NCCL internals AllReduce, ring topology, NCCL tuning

NVLink / NVSwitch GPU-to-GPU interconnect, bandwidth vs PCIe

DDP NCCL NVLink NVSwitch Data Parallelism Model Parallelism

Operations ~4 hrs

GPU Operations

Operate shared GPU environments like a pro. Schedule jobs, monitor utilization, detect failures before they cascade, and read logs that actually tell you something.

Slurm scheduler sbatch, squeue, GPU partitions, preemption

GPU monitoring nvidia-smi, DCGM, utilization vs saturation

Failure detection XID errors, ECC, health checks, alert patterns

Slurm Job Scheduling GPU Monitoring Failure Detection Health Checks

Capstone Full workload

Optimize a Real Training Workload

Apply everything. Take a real training job, profile it end-to-end, identify and fix bottlenecks, and produce a GPU performance report that explains the tradeoffs.

Optimize throughput

Apply mixed precision, profiling, and kernel-level fixes to a real model

GPU performance report

Document findings, show evidence, make data-driven recommendations

Portfolio artifact

A completed report you can show in interviews or use internally

Course Format

Live cohorts. Real labs. GPU access included.

Starts July 1, 2026 Pleasanton, CA Limited seats

The first cohort will be live instructor-led sessions with lab time after each module. We are measuring demand to finalize schedule, depth, and pricing.

Students get access to GPU lab environments — no cloud account required for labs.

Reserve your spot

Weekend live cohort Saturday sessions, recorded for async replay

~24 hours of content Spread across 6 modules with lab time

GPU lab access Hands-on environments provided, no setup required

Certificate of completion With capstone performance report artifact

We are calibrating based on

Schedule preference (weekend vs weeknight)
Current experience level and role
Interest in team / corporate seats
Budget signals for first cohort pricing

Join the Waitlist

Shape the first NullHz cohort.

Starts July 1, 2026 Pleasanton, CA Limited seats

Submit your interest below. Your responses directly inform how the first cohort is structured — schedule, depth, pricing, and lab environment.

Early access pricing before public launch

Priority enrollment — seats are limited

In-person in Pleasanton, CA with GPU labs

Full name Email

Current role GPU / CUDA experience Preferred format What brings you here? What do you want to solve?