Traditional Runbooks can become 10x more useful if they were automated or at least made executable (partly, if not fully). Shreyash Naithani from Microsoft Azure SRE team and author of "Practical Site Reliability Engineering" talks about how to take advantage of runbooks to eliminate toil.

A Runbook is a predefined set of steps or procedures that is usually executed manually by a systems engineer. For instance: say you want to upgrade an application on production, and you have a defined set of steps that are documented. We call this a runbook. It contains procedures to begin, stop, supervise, and debug the system.

Recent research shows that 80% of the time spent by engineering teams is invested into triaging incidents. Over the past few years, due to the shift to a microservice world everyone has experienced an exponential increase in code-base complexity. Managing and monitoring several microservice endpoints means a large number of checkpoints and alerts. As a result, we end up having too many incidents during outages and engineering teams get buried in operational work. This is where Automated and Executable Runbooks come into the picture to rescue engineering teams by setting up auto-mitigation or remediation on top of these incidents.
These runbooks could be triggerred on the basis of Events/Logs and should create incidents only if it needs further attention from the engineering team. Broadly speaking, Runbooks can be categorised as:

  1. Procedural Runbooks: Procedural Runbooks are manual runbooks where you have to just follow the technical documents and run the steps. Here, a systems engineer will use standard tools to access production systems and follow the procedure manually.
  2. Executable Runbooks: Executable Runbooks are similar to procedural Runbooks where systems engineers will follow the procedure as describes but additionally, systems engineers can also run an automation task from his or her machine (could be Shell-Script, Powershell or any other scripts) on a target system and fix the problem.
  3. Automated Runbooks: As the name suggests automated runbooks runs automatically without any manual interaction.

    This blog talks about Automated Runbooks and a few automation tools

    Automated Runbooks allow us to automate time-consuming and repetitive tasks. Using these, we can automate any tasks on one or more servers.

    Listed below are a few instances where automated runbooks can potentially save the day:

    1. Active Directory:
      We can use automated runbooks to update Active directories when any new user is onboarded onto the system. Using these runbooks we can build a user account, adding the user to multiple groups so that they have the appropriate permissions and allot a computer that is configured to the organisational domain.Simiarly there could be many additional activities which might be needed when any new user is onboarded. And with automated runbooks we can automate these manual tasks and help user to onboard quickly.
    2. Virtual Machine/Service Management:
      We can use automated runbooks to manage our Virtual Machine(VM) or services. These can be in scenarios such,
      * Need to restart VMs after patching

      * To know any service status

      * Want to restart any services running in VMs after deployments.

      Also if when we see VMs are in hung state or not serving any traffic/ requests we can create a quick fix type of runbook which we can run on top of active incident and mitigate them.

    3. Log Archive:
      One of the use cases is to automate log management by creating runbooks which can either delete your old data or archive your data into some azure log tables. Later you can use these Azure log table to analyze and get some themes out of them. It could be what types of error our webApps server encountered in last 30 days. Again, by looking into that data you can improve the reliability of the product.
    4. Monitoring:
      Another use case scenario would be monitoring. Using runbooks, we can monitor computer responsiveness. Is the host available on the network? How much disk space is left on the machine? How is the health of the daemon or services? What is the resource utilization for servers? By using any scripting language we can fetch these details and update them on incidents or start our investigations.
    5. Configurations Management:
      Deploying standard baseline configurations can be done using runbooks. These configurations could be services related, client related or network equipment even mobile devices. This way we can meet a certain minimum-security standard as per the organizational security policy. We can also deploy OS and app configuration using runbooks. And if any software or patching needs to be deploy we can achieve it using runbook automation.

    Here are a few runbook automation tools, which we may use for the above -


    Azure Automation:

    Azure automation is Microsoft's cloud hosted automation and configuration service that provides consistent management across your Azure and non-Azure environments. It consists of process automation, update management, and configuration features. Azure Automation provides complete control during deployment, operations, and decommissioning of workloads and resources. It uses PowerShell Runbook or Powershell Graphical/Workflows or Python Runbook. We can trigger these runbooks from Azure Alerts, Webhooks, Schedule, Logic Apps, Another Runbook or watcher tasks.

    In this example, we have created one Powershell Runbook to restart the WebApps servers. We can schedule this Powershell runbook as per requirement also we can trigger it from Webhooks or by schedule.

    $connectionName = "AzureRunAsConnection" 
    $servicePrincipalConnection = Get-AutomationConnection -Name $connectionName 
    $null = Add-AzureRmAccount `
    -ServicePrincipal `
    -TenantId $servicePrincipalConnection.TenantId `
    -ApplicationId $servicePrincipalConnection.ApplicationId `
    -CertificateThumbprint $servicePrincipalConnection.CertificateThumbprint
    $null = Select-AzureRmSubscription -SubscriptionId 'Subs-ID'
    Restart-AzureRmWebApp -ResourceGroupName "SquadCast" -Name "SquadCastWebApps"
    
    Runbooks

    Rundeck:

    Rundeck is a web-accessible console for dispatching commands and scripts to your nodes. It can be also used for deployments, ops tasks etc. Within Rundeck you can easily create Jobs (can be triggered by the scheduler or on-demand), dispatch scripts to the selected nodes or simply some user defined commands. In short, using Rundeck you can automate routine or adhoc tasks by creating runbooks.

    Rundeck Features:

    1. Rundeck supports multi-step workflows.
    2. Distributed command execution.
    3. Job Execution can be done with ad-hoc demands or we can set it with scheduler.
    4. Rundeck provide graphical web console for job execution and command.
    5. It’s a command line interface tool with Web API to operate it from code.
    6. It logs all the command or job execution history for audit purpose.

    Rundeck can integrate with tools in several manners.
    - A Rundeck plugin implemented in Java or shell script, installed into a Rundeck server.
    - External Service.
    - An external service that is used by a Rundeck plugin or the Rundeck core.
    - External Plugin.
    - A plugin installed in another tool that interacts with Rundeck through its API.


    Ansible:

    Ansible is a very powerful open source configuration management tool. Ansible uses 'Playbook' to deploy, manage and configure anything from single server to multi-server environments. Here Playbook is similar like runbooks where you can define a set of procedures.

    Ansible Features:

    1. Agentless: It means there is no need of any software or client/agent to manage your nodes unlike Puppet or chef.
    2. Python Supported: Ansible is build on python and provides lot of python features and modules. Once you install ansible you will see python is also getting installed in your servers.
    3. Secure SSH: Ansible uses secure shell to connect to the servers to do any operation. Secure shell is the password less network authentication protocol. This makes ansible fast and secure than others.
    4. Push Architecture: Ansible follow push-based architecture for deploying any configuration. Whenever you want to push any configuration, just update the playbook and push. It will take care of the rest. In short, central server manages all the configuration and pushes it to the target servers.

    Ansible Playbook written in YAML, declaratively define your configuration. Let's see one example of playbook here we are installing Nginx servers using Ansible.

    - name: Install nginx
      hosts: host.name.ip
      become: true
     
      tasks:
      - name: Add epel-release repo
        yum:
          name: epel-release
          state: present
     
      - name: Install nginx
        yum:
          name: nginx
          state: present
     
      - name: Insert Index Page
        template:
          src: index.html
          dest: /usr/share/nginx/html/index.html
     
      - name: Start NGiNX
        service:
          name: nginx
          state: started
    
    


    Squadcast:

    Squadcast Runbooks will allow you to up level your Incident Management with the next generation Reliability Orchestration Engine based on Site Reliability Engineering (SRE). It is designed to host and execute runbooks automation in response to operational events or incidents. By using Squadcast runbooks you can remove the toil or repetitive tasks from your system. We have already seen how we can create runbook using Azure Automation lets see how easy we can create it using Squadcast.

    Let's take a scenario, your Squadcast dashboard is alerting that your WebApps servers are utilizing more resources, could be due to high CPU or high traffic etc. and as part of the auto-mitigation we would like to create automation to see if resource utilization on webapps server has been increased and crossed a certain threshold say 65%,so we would like to add more WebApps servers to handle those requests for time being. In order to achieve this, we can write a simple runbook using Squadcast and execute these from incident tickets manually or automatically using any scheduler.

    Runbooks Support:
    Currently, Squadcast Runbooks supports the below languages...

    • Shell script
    • Lua script
    • Python3 script
    • NodeJS script
    • Ansible configuration

    Here are some best practices for runbook:

    1. Know your Application:
      Within our application, we need to consider which processes need improvement and when we define processes that could benefit from automation using runbook, we need to start gathering requirements.
    2. Gather Requirement:
      While gathering requirements we should focus on determining input and output values for our runbook whether its automatically supplied or user need to input these values.
    3. Use of Integration pack:
      Integration pack gives us additional runbook activities. For example: if you want to automate user onboarding and that include working with an active directory user account, you are going to need to register and deploy the integration pack for active directory.
    4. Single or multiple host runbook:
      We need to know whether we are going to run our automation runbook for single or multiple host at the same time because we need to design our runbook based on that decision.
    5. Runbook Execution Trigger:
      We should know how we are going to execute the runbook. Will it be a schedule? Is it going to be done periodically so manual? Will it need any kind of user interaction?
    6. Runbook Logs:
      We should also focus on what logs will be needed once runbook executed and where we are going to save these logs for future or debug purpose.

    Conclusion

    With the right amount of automation and strategic process management , you can improve incident remediation instructions and ensure runbooks are updated in a timely manner. This ensures that when an incident occurs next, the documentation is updated and also is available to the right person at the right time

    Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.