Visão Geral
O curso Datadog SRE and SLO Engineering tem como objetivo capacitar profissionais na aplicação de práticas de Site Reliability Engineering (SRE) utilizando a plataforma Datadog. O treinamento aborda definição e gestão de SLIs, SLOs e SLAs, monitoramento de confiabilidade, gestão de incidentes e melhoria contínua da performance e disponibilidade de sistemas.
Conteúdo Programatico
Module 1 – Introduction to SRE and Reliability Engineering
- SRE principles and philosophy
- Difference between SRE and traditional operations
- Reliability, availability, and scalability concepts
- Service lifecycle and reliability goals
- Overview of Datadog for SRE
Module 2 – Service Level Indicators (SLIs)
- Definition and importance of SLIs
- Types of SLIs (latency, availability, throughput, errors)
- Selecting meaningful SLIs
- Data sources (metrics, logs, traces)
- SLI implementation in Datadog
Module 3 – Service Level Objectives (SLOs)
- Defining SLOs aligned with business goals
- SLO types and structures
- Target setting and thresholds
- SLO dashboards in Datadog
- Monitoring SLO compliance
Module 4 – Service Level Agreements (SLAs)
- SLA fundamentals
- Relationship between SLA and SLO
- Business and legal considerations
- Reporting SLA performance
- Managing customer expectations
Module 5 – Error Budgets and Reliability Management
- Error budget concept
- Calculating and tracking error budgets
- Balancing reliability and innovation
- Policies for error budget consumption
- Alerting based on error budgets
Module 6 – Monitoring and Alerting Strategies
- SLO-based alerting
- Burn rate alerts
- Reducing alert fatigue
- Incident detection strategies
- Alert tuning and optimization
Module 7 – Incident Management and Postmortems
- Incident lifecycle management
- Incident response workflows
- Communication during incidents
- Blameless postmortems
- Root cause analysis (RCA)
Module 8 – Continuous Improvement and Reliability Culture
- Reliability maturity models
- Automation and toil reduction
- Chaos engineering basics
- Continuous improvement practices
- Building a reliability-first culture