Paperback AI Engineering: Building Multi-Modal Intelligent Systems with Vision, Language, and Audio From LLM Fine-Tuning to Voice Agents, AR Interfaces, and Rea Book

Share to Pinterest

Share to Twitter

ISBN: B0FKVHKZ4Z

ISBN13: 9798296089038

AI Engineering: Building Multi-Modal Intelligent Systems with Vision, Language, and Audio From LLM Fine-Tuning to Voice Agents, AR Interfaces, and Rea

By Husn Ara

No Customer Reviews

AI Engineering: Building Multi-Modal Intelligent Systems with Vision, Language, and Audio
From LLM Fine-Tuning to Voice Agents, AR Interfaces, and Real-World Deployment

Unlock the future of artificial intelligence with practical, production-ready multi-modal engineering.

This hands-on guide is built for developers, researchers, and AI professionals who want to go beyond chatbots and dive into building intelligent systems that understand text, images, audio, and human intent - all in one pipeline.

Whether you're fine-tuning large language models (LLMs) or creating voice-driven AR interfaces, this book walks you through the real engineering decisions, tools, and architectures needed to bring multi-modal AI to life.

What You'll Learn:

Fine-tuning Large Language Models (LLMs): Train and adapt models like GPT-2, LLaMA, and Mistral for custom tasks using Hugging Face, LoRA, QLoRA, and PEFT.

Voice Interfaces: Combine Whisper, LLMs, and Bark/Tortoise TTS to build interactive speech-driven assistants.

Computer Vision + Language: Use models like BLIP, CLIP, and DETR to connect what systems see to what they say and understand.

Instruction Tuning & Hyperparameter Optimization: Build smarter, domain-specific models with efficient training workflows.

Multi-Modal Pipelines: Chain audio, image, and text inputs for question answering, summarization, tutoring, and AR/robotic control.

Real-Time Interfaces: Deploy intelligent agents using FastAPI, Streamlit, Gradio, Docker, and Hugging Face Spaces.

Edge & Offline Deployment: Optimize models with ONNX, quantization (4-bit, 8-bit), and TensorRT for low-latency inference on CPU/GPU.

Use Cases Covered:

Smart document summarizers with OCR + TTS

Voice-enabled image assistants

Emotion-aware agents

Virtual tutors

AR-enhanced AI interfaces

Robotic perception + control from voice/image input

Secure, multilingual, and privacy-conscious AI systems

Tools & Frameworks Inside:

Python, PyTorch, Hugging Face Transformers

LangChain, OpenCV, Whisper, TTS, BLIP

ROS, Unity (AR/VR), Gradio, Streamlit

Docker, FastAPI, gRPC, TorchServe

Built for engineers. Written with depth. Designed for real-world impact.

If you're ready to build intelligent multi-modal agents that understand the world like humans do - across speech, vision, and language - this book gives you the complete roadmap.

Perfect for:
Machine learning engineers, data scientists, AI product developers, researchers, robotics engineers, and anyone building cutting-edge AI systems.

Format:Paperback

Language:English

ISBN:B0FKVHKZ4Z

ISBN13:9798296089038

Release Date:August 2025

Publisher:Independently Published

Length:296 Pages

Weight:0.88 lbs.

Dimensions:0.6" x 6.0" x 9.0"

Related Subjects

Computers Computers & Technology

Customer Reviews

0 rating

Write a review

ThriftBooks sells millions of used books at the lowest everyday prices. We personally assess every book's quality and offer rare, out-of-print treasures. We deliver the joy of reading in recyclable packaging with free standard shipping on US orders over $20. ThriftBooks.com. Read more. Spend less.

Copyright © 2026 Thriftbooks.com Terms of Use | Privacy Policy | Do Not Sell/Share My Personal Information | Cookie Policy | Cookie Preferences | Accessibility Statement
ThriftBooks ^® and the ThriftBooks ^® logo are registered trademarks of Thrift Books Global, LLC

AI Engineering: Building Multi-Modal Intelligent Systems with Vision, Language, and Audio From LLM Fine-Tuning to Voice Agents, AR Interfaces, and Rea

Recommended

Customer Reviews

Popular Categories

Website

My Account

Partnerships

Quick Help

About Us

Follow Us