How to Reduce MTTR with PagerDuty and Relay

Blog post cover

DevOps and SRE teams are under intense pressure to reduce the Mean Time to Recovery (MTTR) in resolving incidents. With the proliferation of cloud services and the increasing complexity of DevOps toolchains, engineers today need to not only learn how to use these services but also troubleshoot them when an incident is raised at 2 AM. Incident response is still manual today – cobbling together runbooks and ad hoc scripts and orchestrating people to respond. This “digital duct tape” approach results in what we call the “DevOps Dumping Ground”, which ultimately extends MTTR.

How PagerDuty & Relay Work Together

PagerDuty is the industry-leading incident management platform that provides reliable notifications, automatic escalations, on-call scheduling, and other functionality to help teams detect and fix infrastructure problems quickly.

Relay by Puppet is an event-driven automation platform that pulls together all the tools and technologies DevOps engineers need to effectively manage a cloud environment. Unlike many existing workflow automation tools, Relay can intelligently respond to external signals by combining event-based triggers with a powerful workflow engine in a single platform.

The latest integration between Relay and PagerDuty eliminates the “digital duct tape” by creating reusable, event-driven workflows to close the loop on incidents faster through Relay’s event-based automation approach. PagerDuty users can now:

  • Enrich alert data: Using the new Change Events launched at PagerDuty Summit, Relay enhances alerts with diagnostic information to speed time-to-resolution by presenting more context around the alert.
  • Automate incident communication: Whether it’s creating a Slack room, updating a Jira ticket, or notifying team members, Relay ensures that communication is timely and updated.
  • Trigger Auto-Remediation Workflows: Raising PagerDuty incidents can initiate Relay workflow runs to fix troubleshoot & remediate common problems securely and quickly.

Example: How to Automate Incident Communication Plans

A key way to reduce MTTR is to formalize an incident communication plan. Making sure that teams have a robust plan for understanding roles and opening communication channels is key to reducing incident response time. Relay can automate this workflow for you by contacting the on-call engineer with a message detailing content from the incident.

Relay uses “triggers” and “steps” to automate a set of actions. Steps are reusable, modular, and composable – things like getting a user’s info, sending Slack and Twilio messages, and using the PagerDuty Event API to provide more information on an incident. “Triggers” are based on cloud events, git events, monitoring alerts, tickets, and incidents. In the example below, we see how a PagerDuty incident triggers the following incident response workflow utilizing the steps mentioned.

When a new PagerDuty incident is raised, Relay looks up the on-call person’s email address, identifies that user in Jira and Slack, and creates a Jira ticket for the production incident. Relay then creates a Slack room as a production incident command center, invites the on-call in, along with the pertinent engineering manager, and sets the topic of the room with a link to the Jira ticket that has been created. Finally, it sends a message to the Slack room and posts a note with the expectations of how a production incident policy should be followed.

Using PagerDuty’s exciting new Change Events, Relay elaborates on content from the incident with enriched alert data. This enables the individual on call to respond to the incident quickly, with less toil required for ticket creation and communication on what triggered the workflow.

Try out this workflow here.

Relay PagerDuty Production Incident Policy Workflow

Customize your Incident Response

There are several starter workflows available for PagerDuty users, which you can find on their integration page. You can use these workflows to create an issue in Jira, send a message to slack, and send a Twillo SMS automatically when a PagerDuty incident is triggered.

Everyone’s workflow is a little different, so Relay workflows are customizable for use cases. Relay provides contextual help within its sidebar. This feature lets you browse the library of integrations and steps to make it easy to customize your workflow.

Relay Workflow Authoring Library

Sign Up for Relay!

Use Relay with PagerDuty to reduce your incident response time and improve observability. Reducing your mean time to resolution (MTTR) is key to successful DevOps management and enabling event-driven automation will mean that your incident response time is much shorter. Relay makes this easier by using workflows that fix more common and well-understood problems that teams have already identified. You can sign up for our free beta today!