This book is a Technical manual for building ultra-low-latency inference platforms for LLMs and multimodal models. Explains model sharding, tensor/model parallelism, pipeline parallelism, clever batching strategies, quantization techniques, kernel-level optimizations, and GPU/TPU orchestration. Includes on-device and edge patterns, caching strategies, network optimizations, and benchmarking methodologies to select cost-effective hardware-software stacks. Who this book is forInfrastructure and performance engineers optimizing inference pipelines.Platform architects designing low-latency, high-throughput model serving.CTOs evaluating hardware and deployment trade-offs for AI services.Engineers deploying on-edge or hybrid cloud-edge inference topologies.What the reader will learnModel sharding and parallelism techniques for throughput and latency.Batching and pipelining heuristics for real-world traffic patterns.Quantization, pruning, and distillation tactics to reduce compute.Autoscaling, scheduler design, and GPU orchestration best practices.Edge inference patterns and hybrid on-prem/cloud deployment strategies.How to benchmark, profile, and tune end-to-end latency and costs.
ThriftBooks sells millions of used books at the lowest everyday prices. We personally assess every book's quality and offer rare, out-of-print treasures. We deliver the joy of reading in recyclable packaging with free standard shipping on US orders over $20. ThriftBooks.com. Read more. Spend less.