Small Language Models in Production: Optimizing inference, reducing costs, and delivering enterprise-ready AI with quantization and distillation metho

By TALIA GRAHAM

No Customer Reviews

Ship enterprise ready AI that is fast, affordable, and controllable with small language models engineered through quantization and distillation.

Many teams want the benefits of language models, but costs, latency, and compliance block real progress. This book focuses on making production systems work on real infrastructure, with methods that lower memory use, improve tokens per second, and keep behavior auditable. You will see where small models beat larger ones, how to size fleets for peak demand, and how to align performance targets with budgets. The material is grounded in healthcare, finance, retail, and manufacturing examples, so the guidance maps cleanly to day to day decisions.

You will learn practical approaches that move beyond proofs of concept. The book explains how to compress and serve models without losing essential quality, how to benchmark instruction following and safety, and how to meet obligations under current governance standards. Each topic connects to production tasks, such as rollout planning, model monitoring, and incident response. The goal is clear, help you deploy reliable systems that meet service levels and cost controls.

apply weight only quantization with int8 or int4 using gptq and awquse activation quantization including smoothquant and fp8reduce long context costs with kv cache quantization and evictionserve at scale with vllm paged attention and continuous batchingtune tensorrt llm schedulers for throughput and tail latencydeploy hugging face tgi on gaudi and inferentia2use speculative decoding and inflight batching in productionplan hardware across h100 h200 b200 and evaluate gaudi 3model tokens per second ttft and end to end throughputrun edge and on device with llamacpp gguf mlc webgpu and apple mlxconvert pipelines to gguf onnx directml openvino ir and nncfevaluate with mt bench and ifeval plus safety multilingual math and codemap risks with owasp llm top 10 and set enterprise controlsoperate under eu ai act timelines and the nist ai rmf profilebuild logging monitoring canaries autoscaling and rollback plans

Code heavy guide: includes working examples, configs, and commands that you can adapt to real services, from serving stacks to evaluation pipelines.

Get the playbook for small language models in production, and start building systems that are fast, cost aware, and ready for enterprise use, grab your copy today.

Format:Paperback

Language:English

ISBN:B0FTWKZDYR

ISBN13:9798268181524

Release Date:October 2025

Publisher:Independently Published

Length:278 Pages

Weight:1.07 lbs.

Dimensions:0.6" x 7.0" x 10.0"

Related Subjects

Computers Computers & Technology

Customer Reviews

0 rating

Write a review

ThriftBooks sells millions of used books at the lowest everyday prices. We personally assess every book's quality and offer rare, out-of-print treasures. We deliver the joy of reading in recyclable packaging with free standard shipping on US orders over $20. ThriftBooks.com. Read more. Spend less.

Copyright © 2026 Thriftbooks.com Terms of Use | Privacy Policy | Do Not Sell/Share My Personal Information | Cookie Policy | Cookie Preferences | Accessibility Statement
ThriftBooks ^® and the ThriftBooks ^® logo are registered trademarks of Thrift Books Global, LLC

Small Language Models in Production: Optimizing inference, reducing costs, and delivering enterprise-ready AI with quantization and distillation metho

Recommended

Customer Reviews

Popular Categories

Website

My Account

Partnerships

Quick Help

About Us

Follow Us