Skip to content
Scan a barcode
Scan
Paperback Deepspeed in Production: INFERENCE OPTIMIZATION AND MODEL: Deploy LLMs efficiently with optimized serving, quantization, and low-latency inference for Book

ISBN: B0G2FY1FDW

ISBN13: 9798274507356

Deepspeed in Production: INFERENCE OPTIMIZATION AND MODEL: Deploy LLMs efficiently with optimized serving, quantization, and low-latency inference for

Run large language models with predictable latency, controlled cost, and production reliability.

Shipping LLMs is an operational problem. Teams struggle with time to first token, tokens per second, GPU memory pressure, and a moving target of engines and datatypes. This book turns those issues into clear practices you can apply with DeepSpeed and the serving layers you already use.

You get a practical path from checkpoint to stable API, with configuration that fits real workloads, not toy demos. Every topic is grounded in measurable outcomes so your stack meets SLOs under mixed traffic and budget constraints.

place DeepSpeed correctly in your stack and configure kernel injection, tensor parallel, and ZeRO for real servicesunderstand TTFT and throughput from prefill to decode and set metrics for p95 latency and queue timesize and control the KV cache with paged attention, batching, and safe headroom targetsapply quantization that holds up under load, including w8a8, awq, gptq, fp8, and fp4use speculative decoding with a sound drafter choice, acceptance math, and stable fallbacksoperate vllm, tensorrt llm on triton, and tgi with clean api surfaces and core flagsscale with ray serve and plan capacity from workload shapes and arrival patternstune for nvidia hopper and blackwell or amd mi300x, with attention backends and nvlink planningrun on kubernetes with gpu operator, device plugin, mig, and topology aware placementwire observability with prometheus, dcgm, and opentelemetry spans, plus vllm bench, trtllm bench, and genai perfship safely with quotas, redaction, audit logs, go live gates, and instant rollback plans

This is a code heavy guide with working YAML, JSON, Shell, and Python examples that map directly to production, from gateway limits and network policies to rollout templates and exportable benchmark scripts.

Grab your copy today and build an LLM service that stays fast, measurable, and dependable.

Recommended

Format: Paperback

Condition: New

$33.72
Save $1.23!
List Price $34.95
Ships within 2-3 days
Save to List

Customer Reviews

0 rating
Copyright © 2026 Thriftbooks.com Terms of Use | Privacy Policy | Do Not Sell/Share My Personal Information | Cookie Policy | Cookie Preferences | Accessibility Statement
ThriftBooks ® and the ThriftBooks ® logo are registered trademarks of Thrift Books Global, LLC
GoDaddy Verified and Secured