Mastering InfiniBand is a definitive, practitioner-focused guide to designing, building, and operating the fabrics that power modern HPC clusters, AI training platforms, and data-centric infrastructure. It distills the InfiniBand architecture from first principles-end-to-end channel semantics, addressing (GUIDs, LIDs, GIDs), packet formats, virtual lanes, and credit-based flow control-through management planes (SMA, SM, SA, PMA, BMA) and IP transport via IPoIB. The book then grounds readers in physical and link-layer engineering, covering signaling from SDR to HDR/NDR and emerging XDR, lane bonding and breakouts, FEC/CRC and error propagation, port state machines, arbitration and deadlock avoidance, optics and cabling for reach and BER, and structured wiring with proactive telemetry to keep large-scale fabrics healthy. For software and system engineers, the text provides a deep dive into transport semantics and the RDMA programming model: RC, UC, UD, XRC, and DC; queue pairs and scalable completion paths; work requests, S/G lists, and polling strategies; memory registration, MR caching, and ODP; atomics, fencing, and ordering. Advanced coverage of mlx5 direct verbs and DevX enables direct hardware programming, while guidance on doorbells, BlueFlame, inline thresholds, batching, tag-matching offload, and multi-rail striping shows how to extract real-world performance. Integration chapters bridge the fabric to MPI (UCX, libfabric/OFI, HPC-X), in-network compute with SHARP, GPU networking with GPUDirect RDMA/Async and NCCL topology-aware collectives, storage over RDMA (SRP, iSER, NVMe/RDMA, SMB Direct) and parallel file systems, plus virtualization (SR-IOV, VFIO, nested) and Kubernetes device plugins, CNI, and pod-level QoS-ensuring clean workflows across HPC, AI, and service-oriented stacks. Architects and operators will find rigorous treatment of fabric topologies (fat-tree, dragonfly(+), torus, hypercube), routing strategies and adaptive policies, QoS design, congestion control and tuning, multicast scaling, and capacity planning. A comprehensive performance engineering toolkit spans host architecture (PCIe/NVLink, NUMA), IOMMU/ATS, huge pages, message sizing, connection scaling, interrupt moderation, jitter and tail-latency control, along with fair microbenchmarking and end-to-end roofline-style modeling. Day-2 operations are covered end to end: PMA-driven telemetry pipelines, SLO dashboards, BER/FEC health signals, failure domains and fast reroute, troubleshooting loops and misroutes, incast containment, packet capture and tracing, and incident response playbooks. The roadmap closes with HDR/NDR deployment trade-offs, InfiniBand routers and multi-subnet scale-out, Ethernet interoperability and RoCE contrasts, DPUs and control-plane offload, time sync, energy efficiency, zero-trust security, migration strategies, and the future of in-network compute and XDR-equipping readers to build resilient, efficient fabrics that scale with confidence.
ThriftBooks sells millions of used books at the lowest everyday prices. We personally assess every book's quality and offer rare, out-of-print treasures. We deliver the joy of reading in recyclable packaging with free standard shipping on US orders over $15. ThriftBooks.com. Read more. Spend less.