Deepspeed in Production: INFERENCE OPTIMIZATION AND MODEL: Deploy LLMs efficiently with optimized serving, quantization, and low-latency inference for

By TARA MALHOTRA

No Customer Reviews

Run large language models with predictable latency, controlled cost, and production reliability.

Shipping LLMs is an operational problem. Teams struggle with time to first token, tokens per second, GPU memory pressure, and a moving target of engines and datatypes. This book turns those issues into clear practices you can apply with DeepSpeed and the serving layers you already use.

You get a practical path from checkpoint to stable API, with configuration that fits real workloads, not toy demos. Every topic is grounded in measurable outcomes so your stack meets SLOs under mixed traffic and budget constraints.

place DeepSpeed correctly in your stack and configure kernel injection, tensor parallel, and ZeRO for real servicesunderstand TTFT and throughput from prefill to decode and set metrics for p95 latency and queue timesize and control the KV cache with paged attention, batching, and safe headroom targetsapply quantization that holds up under load, including w8a8, awq, gptq, fp8, and fp4use speculative decoding with a sound drafter choice, acceptance math, and stable fallbacksoperate vllm, tensorrt llm on triton, and tgi with clean api surfaces and core flagsscale with ray serve and plan capacity from workload shapes and arrival patternstune for nvidia hopper and blackwell or amd mi300x, with attention backends and nvlink planningrun on kubernetes with gpu operator, device plugin, mig, and topology aware placementwire observability with prometheus, dcgm, and opentelemetry spans, plus vllm bench, trtllm bench, and genai perfship safely with quotas, redaction, audit logs, go live gates, and instant rollback plans

This is a code heavy guide with working YAML, JSON, Shell, and Python examples that map directly to production, from gateway limits and network policies to rollout templates and exportable benchmark scripts.

Grab your copy today and build an LLM service that stays fast, measurable, and dependable.

Format:Paperback

Language:English

ISBN:B0G2FY1FDW

ISBN13:9798274507356

Release Date:November 2025

Publisher:Independently Published

Length:290 Pages

Weight:1.12 lbs.

Dimensions:0.6" x 7.0" x 10.0"

Related Subjects

Computers Computers & Technology

Customer Reviews

0 rating

Write a review

ThriftBooks sells millions of used books at the lowest everyday prices. We personally assess every book's quality and offer rare, out-of-print treasures. We deliver the joy of reading in recyclable packaging with free standard shipping on US orders over $20. ThriftBooks.com. Read more. Spend less.

Copyright © 2026 Thriftbooks.com Terms of Use | Privacy Policy | Do Not Sell/Share My Personal Information | Cookie Policy | Cookie Preferences | Accessibility Statement
ThriftBooks ^® and the ThriftBooks ^® logo are registered trademarks of Thrift Books Global, LLC

Deepspeed in Production: INFERENCE OPTIMIZATION AND MODEL: Deploy LLMs efficiently with optimized serving, quantization, and low-latency inference for

Recommended

Customer Reviews

Popular Categories

Website

My Account

Partnerships

Quick Help

About Us

Follow Us