vLLM in Production: Running LLMs at Scale with GPUs, High-Performance Inference & Modern AI Infrastructure

By Hollis Denning

No Customer Reviews

LLM inference is no longer experimental-it is production infrastructure.
As models grow larger, applications become agent-driven, and real users arrive, the true bottleneck shifts from training to serving models reliably, securely, and at scale.

vLLM in Production is a hands-on, operator-first guide to running large language models in real environments-where GPUs are finite, latency matters, failures happen, and cost must be controlled.

This book is not about prompt engineering or theoretical AI. It is about engineering discipline: how inference actually behaves under load, why na ve deployments collapse, and how to design systems that remain stable when traffic, context length, and concurrency collide.

You will learn how vLLM achieves high throughput through architectural choices like PagedAttention and continuous batching, and-more importantly-how to deploy, tune, and operate it safely in production. The book walks you from single-node GPU servers to full-stack inference platforms with APIs, authentication, agents, retrieval-augmented generation (RAG), monitoring, and failure recovery.

Every chapter is practical. Every concept is validated through labs. Every design choice is grounded in real operational tradeoffs.

What You'll LearnWhy LLM inference breaks at scale and how to avoid common failure modesHow GPU memory, KV cache behavior, and context length define real capacityHow vLLM works internally-and what actually makes it fastHow to deploy vLLM on bare metal, virtualized GPU hosts, and private cloudsHow to serve models through secure, OpenAI-compatible APIsHow to run agent and RAG workloads without runaway cost or instabilityHow to load-test inference systems and identify safe operating limitsHow to monitor GPU utilization, latency, and throughput that truly matterHow to troubleshoot OOMs, throughput collapse, and latency spikesHow to plan capacity, estimate inference cost, and scale responsiblyHands-On, End to End

This book is built around lab-first, failure-driven learning:

Chapter-based practice labs reinforce every major conceptA full-stack capstone project guides you through designing, deploying, and operating a production-grade inference platform using vLLM, GPUs, APIs, agents, RAG, and monitoringOperator-grade appendices provide cheat sheets, runbooks, security checklists, and 2026-ready roadmaps you can reuse in real systemsWho This Book Is ForBackend and platform engineers running LLMs in productionInfrastructure and DevOps teams managing GPU-backed servicesAI engineers building agent-based and RAG-powered applicationsTechnical founders and builders operating private or on-prem inference platforms

If you are responsible for uptime, latency, cost, or reliability, this book is for you.

Who This Book Is Not For

This is not an introductory AI book.
It does not cover prompt engineering, model training, or high-level AI theory.
It assumes you want to operate LLM inference systems-not experiment with them.

Why This Book Stands Out

Most resources stop at "it works."
This book starts where demos fail.

vLLM in Production teaches you how to:

Enforce limits instead of hoping for stabilityReject traffic safely instead of crashing GPUsMeasure reality instead of guessing performanceRecover from failure instead of restarting blindly

LLM inference is now core infrastructure.
This book shows you how to run it like one.

Format:Paperback

Language:English

ISBN:B0GK1LN7VZ

ISBN13:9798245694542

Release Date:January 2026

Publisher:Independently Published

Length:250 Pages

Weight:1.30 lbs.

Dimensions:0.5" x 8.5" x 11.0"

Related Subjects

Computers Computers & Technology

Customer Reviews

0 rating

Write a review

ThriftBooks sells millions of used books at the lowest everyday prices. We personally assess every book's quality and offer rare, out-of-print treasures. We deliver the joy of reading in recyclable packaging with free standard shipping on US orders over $20. ThriftBooks.com. Read more. Spend less.

Copyright © 2026 Thriftbooks.com Terms of Use | Privacy Policy | Do Not Sell/Share My Personal Information | Cookie Policy | Cookie Preferences | Accessibility Statement
ThriftBooks ^® and the ThriftBooks ^® logo are registered trademarks of Thrift Books Global, LLC

vLLM in Production: Running LLMs at Scale with GPUs, High-Performance Inference & Modern AI Infrastructure

Recommended

Customer Reviews

Popular Categories

Website

My Account

Partnerships

Quick Help

About Us

Follow Us