Paperback Multimodal Systems in Practice: A Technical Guide to Agents, Tools, and Architectures That Combine Vision, Speech, and Language Book

Share to Pinterest

Share to Twitter

ISBN: B0FHF7SHTT

ISBN13: 9798292171553

Multimodal Systems in Practice: A Technical Guide to Agents, Tools, and Architectures That Combine Vision, Speech, and Language

By Ben Laurenson

No Customer Reviews

A Technical Guide to Agents, Tools, and Architectures That Combine Vision, Speech, and Language

Unlock the future of artificial intelligence with the first deeply practical guide to building and understanding multimodal AI systems - agents and architectures that can see, hear, speak, and reason.

Whether you're a machine learning engineer, AI researcher, or technical product leader, Multimodal Systems in Practice equips you with the tools, frameworks, and know-how to build powerful multimodal agents using vision, speech, and language models - all in real-world settings.

This definitive guide covers cutting-edge systems like GPT-4o, Gemini 1.5, Claude 3.5, ImageBind, Whisper, Sora, Runway, and more - showing how they work under the hood and how to integrate them in practical pipelines.

What You'll Learn:

How Multimodal AI Models Work: Understand the architectures behind vision-language models (VLMs), audio-text agents, and real-time multimodal perception.

How to Build with LangChain, Hugging Face & OpenAI Tools: Construct multimodal pipelines using popular frameworks.

How to Combine Images, Audio & Text in One System: Step-by-step examples for building agents that speak, see, listen, and act in real time.

How to Evaluate & Deploy Multimodal Systems: Master benchmarking, memory management, and safety protocols for production-ready systems.

How to Navigate Ethics and Risks: Address hallucinations, deepfake risks, prompt injection attacks, and visual bias.

Key Topics Include:

Multimodal representation learning (CLIP, FLAVA, Flamingo)

Real-time speech + text agents with Whisper and SeamlessM4T

Tool-augmented agents using LangChain and CrewAI

Multimodal retrieval and long-horizon context with memory buffers

Video understanding models like Sora and Make-A-Video

Autonomous agent design, multimodal reinforcement learning, hybrid AI systems

Who This Book Is For:

AI engineers & ML developers building production-grade AI systems

Technical product teams deploying intelligent assistants and agents

Researchers exploring cross-modal representation and fusion

Advanced practitioners working with vision-language or speech-language models

Builders experimenting with multimodal LLMs like GPT-4o, Gemini, Claude

Format:Paperback

Language:English

ISBN:B0FHF7SHTT

ISBN13:9798292171553

Release Date:July 2025

Publisher:Independently Published

Length:174 Pages

Weight:0.69 lbs.

Dimensions:0.4" x 7.0" x 10.0"

Related Subjects

Computers Computers & Technology

Customer Reviews

0 rating

Write a review

ThriftBooks sells millions of used books at the lowest everyday prices. We personally assess every book's quality and offer rare, out-of-print treasures. We deliver the joy of reading in recyclable packaging with free standard shipping on US orders over $20. ThriftBooks.com. Read more. Spend less.

Copyright © 2026 Thriftbooks.com Terms of Use | Privacy Policy | Do Not Sell/Share My Personal Information | Cookie Policy | Cookie Preferences | Accessibility Statement
ThriftBooks ^® and the ThriftBooks ^® logo are registered trademarks of Thrift Books Global, LLC

Multimodal Systems in Practice: A Technical Guide to Agents, Tools, and Architectures That Combine Vision, Speech, and Language

Recommended

Customer Reviews

Popular Categories

Website

My Account

Partnerships

Quick Help

About Us

Follow Us