As generative models move from experimentation into production, performance, reliability, and control become as important as model quality. Generative AI with C++ looks beneath the abstractions to show how modern transformers and large language models actually run-how they move data, consume memory, and behave under real-world constraints.
Written from a systems-engineering perspective, this book explains generative AI as an execution problem, not an API call. It connects model architecture to hardware behavior, revealing why inference latency grows, where optimizations truly pay off, and how C++ enables predictable, long-lived AI systems at scale.
This is a practical guide for engineers who want to understand generative AI from the inside out.
A clear understanding of how transformers and LLMs execute at runtime
Practical insight into inference latency, memory behavior, and decoding cost
A grounded explanation of quantization, pruning, mixed precision, and caching
Real-world optimization techniques that scale beyond benchmarks
The ability to reason about performance trade-offs instead of guessing
A systems-level perspective that connects models, compilers, and hardware
Performance issues rarely come from the model alone-they come from execution
C++ remains the language of choice for controlled, high-performance AI systems
Understanding how models run is now as important as understanding what they do