VictorOps and Relay for Incident Response

Blog post cover

VictorOps is an incident response tool whose mission is straightforward: “To make being on call suck less.” It enables teams to quickly detect and respond to problems like a service degredation or outage. VictorOps supports a wide range of external integrations to extend its capabilities by connecting different parts of your DevOps toolchain. Linking in both Pingdom alert data and GitHub commits, for example, can surface correlations between a recent deployment and a slowdown in user-facing response times.

Relay helps VictorOps respond to incidents with sophisticated multi-step workflows. Frequently, a response policy may involve a combination of notifications and setting up communication channels for the response team to coordinate activities. The linkages between the initial VictorOps incident, a Slack channel to troubleshoot the issue in real time, and Jira tickets to track after-action reports can be tricky to maintain. With Relay, the initial incident escalation just needs to emit a webhook event and the workflow does the rest.

This blog post walks through that exact scenario: we’ll configure a webhook that fires as part of a VictorOps incident escalation policy, triggering a Relay workflow that will set up the communication channels and update the incident timeline.

Workflow Setup

Before you run this workflow, you will need a free Relay account, as well as a VictorOps account with administrator privileges so you can add the webhook and modify the escalation policy. You can visit the workflow’s homepage on relay.sh to get started, or click this link to add it to your Relay account.

Once the workflow’s active, it’ll prompt you to add the following connections:

  • A Jira account.
  • A Slack workspace bot with the following permissions:

    • channels:manage to create the channel and set the topic
    • chat:write to send messages
    • chat:write.public to send messages to channels without joining
    • chat:write.customize to send messages as a customized username and avatar

You’ll also need to enable the REST integration on VictorOps and add the generated endpoint URL as a workflow Secret named endpointURL.

You may need to update some of the default parameters or connection information in this workflow to run in your environment. The default configuration assumes:

  • Your Jira connection is called my-jira-account
  • Your Slack connection is called my-slack-account
  • Your Jira project key is RLY
  • Your incident slack channels will be named #team-relay-production-incident-<incidentID>

Once the workflow is set up in Relay, it should look something like this:

Screenshot of Relay web app showing Connections and Secrets

VictorOps Setup

Copy the Webhook URL from the sidebar and, in VictorOps, go to Integrations and enable the Webhooks integration. Add a new webhook, give it a memorable name and paste the Relay webhook URL into the dialog.

You’ll then need to associate the webhook name with one or more Escalation Policies, so the workflow will be triggered upon incident creation. Updates from Relay will automatically get associated with the timeline of the VictorOps incident which triggered them. In this example escalation policy, I both page the primary on-call person and send the webhook to Relay at the same time. (“Being in the Crow’s Nest” is our name for being on-call for the production Relay service!)

Screenshot of VictorOps escalation policy showing paging and webhook as two actions

Pulling it all together

When I create an incident, VictorOps sends the paging notification and the workflow takes care of the response. The workflow creates a new Jira ticket and a dedicated Slack channel, messages our #general channel to let everyone know something’s up, and populates the response Slack channel with the context provided from the initial incident creation.

Screenshot of Slack channel created for the incident

Back in VictorOps, the incident timeline gets updated with this new context so response coordination can continue.

Screenshot of VictorOps incident timeline

Next steps

This example shows how Relay can extend VictorOps’ response capabilities, but it’s only the start. When there are known remediations for particular kinds of incidents, it’d be awesome to attempt to fix them with a workflow and post the results back - potentially resolving the incident automatically if the fix works! Fixing disk space alerts by rotating logs or adding more storage, restarting services on out-of-memory errors, or rolling back a deployment could all be done via Relay workflows. Tighter integration with VictorOps’ own Slack bot could provide the next level of interactivity for teams, and we’d love to see what else people come up with.

Give the workflow a try and let us know how it goes!