Visão Geral
O Curso Kafka Cluster High Availability & Disaster Recovery ensina como projetar, configurar e gerenciar ambientes Apache Kafka resilientes e tolerantes a falhas, garantindo alta disponibilidade (HA) e recuperação de desastres (DR) em infraestruturas críticas.
Durante o treinamento, os participantes aprenderão estratégias de replicação, failover, backup e restauração, além de boas práticas para mitigar interrupções e proteger dados em trânsito e em repouso.
Com foco em arquiteturas corporativas e ambientes distribuídos, este curso oferece uma abordagem prática e detalhada para manter Kafka sempre disponível e confiável.
Conteúdo Programatico
Module 1: Introduction to High Availability in Kafka
- Understanding HA and DR concepts in distributed systems
- Kafka architecture and fault tolerance mechanisms
- The role of replication, partitions, and leader election
- Common failure scenarios and how Kafka handles them
Module 2: Broker and Cluster Replication
- In-sync replicas (ISR) and replication factor tuning
- Leader-follower synchronization
- Handling unclean leader election and replica lag
- Hands-on: configuring and testing replication reliability
Module 3: Designing for High Availability
- Multi-broker cluster design principles
- Data replication strategies across racks and nodes
- Avoiding single points of failure (SPOF)
- Using KRaft mode for cluster coordination redundancy
Module 4: Kafka Multi-Cluster and Cross-Region Replication
- Understanding multi-datacenter deployments
- Implementing asynchronous replication between clusters
- Using MirrorMaker 2.0 for cross-region replication
- Design patterns for active-active and active-passive clusters
Module 5: Backup and Disaster Recovery Strategies
- Backup and restore procedures for Kafka topics and metadata
- Snapshots, incremental backups, and retention policies
- Planning and executing disaster recovery drills
- Automating backup pipelines using scripts and tools
Module 6: Failover, Monitoring, and Recovery Automation
- Configuring broker auto-restart and recovery
- Using Cruise Control for rebalancing and failover
- Monitoring replication and leader election health
- Integrating alerting with Prometheus and Grafana
Module 7: Testing, Validation, and Performance Under Failover
- Simulating broker and network failures
- Measuring recovery time (RTO) and data loss tolerance (RPO)
- Stress testing and performance benchmarking during failover
- Case study: end-to-end failover and DR validation
Module 8: Hands-On Project
Project: Configure a multi-broker Kafka cluster with MirrorMaker 2.0, implement cross-region replication, simulate broker failures, and execute a full recovery test scenario.