Curso Datadog SRE and SLO Engineering

  • DevOps | CI | CD | Kubernetes | Web3

Curso Datadog SRE and SLO Engineering

24 horas
Visão Geral

O curso Datadog SRE and SLO Engineering tem como objetivo capacitar profissionais na aplicação de práticas de Site Reliability Engineering (SRE) utilizando a plataforma Datadog. O treinamento aborda definição e gestão de SLIs, SLOs e SLAs, monitoramento de confiabilidade, gestão de incidentes e melhoria contínua da performance e disponibilidade de sistemas.

Objetivo

Após realizar este curso Datadog SRE and SLO Engineering, você será capaz de:

  • Compreender os princípios de SRE
  • Definir SLIs, SLOs e SLAs de forma eficiente
  • Implementar monitoramento baseado em objetivos de serviço
  • Utilizar Datadog para acompanhar confiabilidade
  • Gerenciar error budgets
  • Melhorar disponibilidade e performance de sistemas
  • Implementar práticas de incident management
  • Promover cultura de confiabilidade e melhoria contínua
Publico Alvo
  • Engenheiros SRE
  • Engenheiros DevOps
  • Profissionais de infraestrutura e operações
  • Arquitetos de sistemas
  • Engenheiros de software envolvidos com confiabilidade
Pre-Requisitos
  • Conhecimentos básicos de monitoramento e observabilidade
  • Noções de Datadog (recomendado)
  • Familiaridade com ambientes distribuídos e cloud
Materiais
Inglês/Português + Exercícios + Lab Pratico
Conteúdo Programatico

Module 1 – Introduction to SRE and Reliability Engineering

  1. SRE principles and philosophy
  2. Difference between SRE and traditional operations
  3. Reliability, availability, and scalability concepts
  4. Service lifecycle and reliability goals
  5. Overview of Datadog for SRE

Module 2 – Service Level Indicators (SLIs)

  1. Definition and importance of SLIs
  2. Types of SLIs (latency, availability, throughput, errors)
  3. Selecting meaningful SLIs
  4. Data sources (metrics, logs, traces)
  5. SLI implementation in Datadog

Module 3 – Service Level Objectives (SLOs)

  1. Defining SLOs aligned with business goals
  2. SLO types and structures
  3. Target setting and thresholds
  4. SLO dashboards in Datadog
  5. Monitoring SLO compliance

Module 4 – Service Level Agreements (SLAs)

  1. SLA fundamentals
  2. Relationship between SLA and SLO
  3. Business and legal considerations
  4. Reporting SLA performance
  5. Managing customer expectations

Module 5 – Error Budgets and Reliability Management

  1. Error budget concept
  2. Calculating and tracking error budgets
  3. Balancing reliability and innovation
  4. Policies for error budget consumption
  5. Alerting based on error budgets

Module 6 – Monitoring and Alerting Strategies

  1. SLO-based alerting
  2. Burn rate alerts
  3. Reducing alert fatigue
  4. Incident detection strategies
  5. Alert tuning and optimization

Module 7 – Incident Management and Postmortems

  1. Incident lifecycle management
  2. Incident response workflows
  3. Communication during incidents
  4. Blameless postmortems
  5. Root cause analysis (RCA)

Module 8 – Continuous Improvement and Reliability Culture

  1. Reliability maturity models
  2. Automation and toil reduction
  3. Chaos engineering basics
  4. Continuous improvement practices
  5. Building a reliability-first culture
TENHO INTERESSE

Cursos Relacionados

Curso Ansible Red Hat Basics Automation Technical Foundation

16 horas

Curso Terraform Deploying to Oracle Cloud Infrastructure

24 Horas

Curso Ansible Linux Automation with Ansible

24 horas

Ansible Overview of Ansible architecture

16h

Advanced Automation: Ansible Best Practices

32h