Skip to content
Scan a barcode
Scan
Paperback Warp 2.0: High-Performance Infrastructure for Low-Latency AI and Distributed Inference: Distributed serving, quantization, hardware orchestration, edg Book

ISBN: B0FQVLNM51

ISBN13: 9798264957444

Warp 2.0: High-Performance Infrastructure for Low-Latency AI and Distributed Inference: Distributed serving, quantization, hardware orchestration, edg

This book is a Technical manual for building ultra-low-latency inference platforms for LLMs and multimodal models. Explains model sharding, tensor/model parallelism, pipeline parallelism, clever batching strategies, quantization techniques, kernel-level optimizations, and GPU/TPU orchestration. Includes on-device and edge patterns, caching strategies, network optimizations, and benchmarking methodologies to select cost-effective hardware-software stacks.
Who this book is forInfrastructure and performance engineers optimizing inference pipelines.Platform architects designing low-latency, high-throughput model serving.CTOs evaluating hardware and deployment trade-offs for AI services.Engineers deploying on-edge or hybrid cloud-edge inference topologies.What the reader will learnModel sharding and parallelism techniques for throughput and latency.Batching and pipelining heuristics for real-world traffic patterns.Quantization, pruning, and distillation tactics to reduce compute.Autoscaling, scheduler design, and GPU orchestration best practices.Edge inference patterns and hybrid on-prem/cloud deployment strategies.How to benchmark, profile, and tune end-to-end latency and costs.

Recommended

Format: Paperback

Temporarily Unavailable

We receive fewer than 1 copy every 6 months.

Save to List

Customer Reviews

0 rating
Copyright © 2026 Thriftbooks.com Terms of Use | Privacy Policy | Do Not Sell/Share My Personal Information | Cookie Policy | Cookie Preferences | Accessibility Statement
ThriftBooks ® and the ThriftBooks ® logo are registered trademarks of Thrift Books Global, LLC
GoDaddy Verified and Secured