Visão Geral
Este Curso Data Platform Reliability Engineering, aborda os princípios, práticas e técnicas de Reliability Engineering aplicadas especificamente a plataformas de dados modernas. O foco está em garantir disponibilidade, confiabilidade, resiliência, previsibilidade e recuperação de plataformas analíticas e pipelines de dados em produção. O aluno aprenderá a aplicar conceitos de SRE ao contexto de dados, equilibrando velocidade de entrega com estabilidade operacional e impacto no negócio.
Conteúdo Programatico
Module 1: Reliability Engineering for Data Platforms
- What is reliability engineering
- Reliability challenges in data platforms
- Data availability vs data correctness
- Reliability and business impact
Module 2: SRE Principles Applied to Data
- Error budgets for data systems
- Balancing velocity and stability
- Toil in data operations
- Reliability ownership models
Module 3: SLIs, SLOs and SLAs for Data
- Defining data SLIs
- Freshness, completeness and accuracy SLOs
- Consumer-driven SLAs
- Measuring reliability over time
Module 4: Designing Reliable Data Pipelines
- Idempotent pipeline design
- Failure isolation strategies
- Dependency management
- Backfill and reprocessing
Module 5: Failure Modes and Risk Management
- Common data platform failure modes
- Dependency risk analysis
- Capacity and scaling risks
- Change management
Module 6: Incident Management for Data Platforms
- Detecting data incidents
- Incident response workflows
- Communication and escalation
- Incident resolution strategies
Module 7: Postmortems and Continuous Improvement
- Blameless postmortems
- Root cause analysis
- Action items and follow-ups
- Reliability learning loops
Module 8: Automation and Reliability
- Reducing manual operations
- Self-healing data pipelines
- Automated recovery strategies
- Reliability-driven automation
Module 9: Operating Data Platforms at Scale
- Multi-environment reliability
- Cost vs reliability trade-offs
- Platform evolution and migrations
- Long-term reliability planning
Module 10: Best Practices and Real-World Scenarios
- End-to-end reliable data platform
- Anti-patterns in data reliability
- Reliability maturity models
- Preparing for platform excellence