Skip to content
Scan a barcode
Scan
Paperback Multimodal Systems in Practice: A Technical Guide to Agents, Tools, and Architectures That Combine Vision, Speech, and Language Book

ISBN: B0FHF7SHTT

ISBN13: 9798292171553

Multimodal Systems in Practice: A Technical Guide to Agents, Tools, and Architectures That Combine Vision, Speech, and Language

A Technical Guide to Agents, Tools, and Architectures That Combine Vision, Speech, and Language

Unlock the future of artificial intelligence with the first deeply practical guide to building and understanding multimodal AI systems - agents and architectures that can see, hear, speak, and reason.

Whether you're a machine learning engineer, AI researcher, or technical product leader, Multimodal Systems in Practice equips you with the tools, frameworks, and know-how to build powerful multimodal agents using vision, speech, and language models - all in real-world settings.

This definitive guide covers cutting-edge systems like GPT-4o, Gemini 1.5, Claude 3.5, ImageBind, Whisper, Sora, Runway, and more - showing how they work under the hood and how to integrate them in practical pipelines.


What You'll Learn:

How Multimodal AI Models Work: Understand the architectures behind vision-language models (VLMs), audio-text agents, and real-time multimodal perception.

How to Build with LangChain, Hugging Face & OpenAI Tools: Construct multimodal pipelines using popular frameworks.

How to Combine Images, Audio & Text in One System: Step-by-step examples for building agents that speak, see, listen, and act in real time.

How to Evaluate & Deploy Multimodal Systems: Master benchmarking, memory management, and safety protocols for production-ready systems.

How to Navigate Ethics and Risks: Address hallucinations, deepfake risks, prompt injection attacks, and visual bias.


Key Topics Include:

Multimodal representation learning (CLIP, FLAVA, Flamingo)

Real-time speech + text agents with Whisper and SeamlessM4T

Tool-augmented agents using LangChain and CrewAI

Multimodal retrieval and long-horizon context with memory buffers

Video understanding models like Sora and Make-A-Video

Autonomous agent design, multimodal reinforcement learning, hybrid AI systems


Who This Book Is For:

AI engineers & ML developers building production-grade AI systems

Technical product teams deploying intelligent assistants and agents

Researchers exploring cross-modal representation and fusion

Advanced practitioners working with vision-language or speech-language models

Builders experimenting with multimodal LLMs like GPT-4o, Gemini, Claude

Recommended

Format: Paperback

Temporarily Unavailable

We receive fewer than 1 copy every 6 months.

Save to List

Customer Reviews

0 rating
Copyright © 2026 Thriftbooks.com Terms of Use | Privacy Policy | Do Not Sell/Share My Personal Information | Cookie Policy | Cookie Preferences | Accessibility Statement
ThriftBooks ® and the ThriftBooks ® logo are registered trademarks of Thrift Books Global, LLC
GoDaddy Verified and Secured