Crying Cloud

Automating OMS Searches, Schedules, and Alerts in Azure Resource Manager Templates

OMS-Icon.png

Recently, I had an opportunity to produce a proof of concept around leveraging Operations Management Suite (OMS) as a centralized point of monitoring, analytics, and alerting.  As in any good Azure project, codifying your Azure resources in an Azure Resource Manager template should be table stakes. While the official ARM schema doesn't yet have all the children for the Microsoft.OperationalInsights/workspaces type, the Log Analytics documentation does document samples for interesting use cases that demonstrate data sources and saved searches.  Notably absent from this list, however, is anything about the /schedules and /action types. Now, a clever Azure Developer will realize that an ARM template is basically an orchestration and expression wrapper around the REST APIs.  In fact, I've yet to see a case where an ARM template supports a type that's not provided by those APIs.  With that knowledge and the alert API samples, we can adapt the armclient put <url> examples to an ARM template.

The Resources

What you see as an alert in the OMS portal is actually 2 to 3 distinct resources in the ARM model:

A mandatory schedules type, which defines the period and time window for the alert

{
  "name": "[concat(parameters('Workspace-Name'), '/', variables('Search-Name'), '/', variables('Schedule-Name'))]",
  "type": "Microsoft.OperationalInsights/workspaces/savedSearches/schedules",
  "apiVersion": "2015-03-20",
  "location": "[parameters('Workspace-Location')]",
  "dependsOn": [
    "[variables('Search-Id')]"
  ],
  "properties": {
    "Interval": 5,
    "QueryTimeSpan": 5,
    "Active": "true"
  }
}

A mandatory actions type, defining the trigger and email notification settings

{
  "name": "[concat(parameters('Workspace-Name'), '/', variables('Search-Name'), '/', variables('Schedule-Name'), '/', variables('Alert-Name'))]",
  "type": "Microsoft.OperationalInsights/workspaces/savedSearches/schedules/actions",
  "apiVersion": "2015-03-20",
  "location": "[parameters('Workspace-Location')]",
  "dependsOn": [
    "[variables('Schedule-Id')]"
  ],
  "properties": {
    "Type": "Alert",
    "Name": "[parameters('Search-DisplayName')]",
    "Threshold": {
      "Operator": "gt",
      "Value": 0
    },
    "Version": 1
  }
}

An optional, secondary, actions type, defining the alert's webhook settings

{
  "name": "[concat(parameters('Workspace-Name'), '/', variables('Search-Name'), '/', variables('Schedule-Name'), '/', variables('Action-Name'))]",
  "type": "Microsoft.OperationalInsights/workspaces/savedSearches/schedules/actions",
  "apiVersion": "2015-03-20",
  "location": "[parameters('Workspace-Location')]",
  "dependsOn": [
    "[variables('Schedule-Id')]"
  ],
  "properties": {
    "Type": "Webhook",
    "Name": "[variables('Action-Name')]",
    "WebhookUri": "[parameters('Action-Uri')]",
    "CustomPayload": "[parameters('Action-Payload')]",
    "Version": 1
  }
}

The Caveats

The full template adds a few eccentricities that are necessary for fully functioning resources.

First, let's talk about the search.  At the OMS interface, when you create a search, it is named <Category>|<Name> behind the scenes.  It is also lower-cased.  Now, the APIs will not downcase any name you send in, but if you use the interface to update that API-deployed search, you will duplicate it rather than being prompted to overwrite.  This tells us that for whatever reason, OMS' identifiers are case-sensitive.  This is why the template accepts a search category and name in parameters, then use variables to lower them and match the interface's naming convention.

Second, a schedule's name must be globally unique to your workspace, though the actions need not be.  Again, the OMS UI will take over here and generate GUIDs for the resources behind the scenes.  Unfortunately for us, ARM Templates don't support random number or GUID generation - a necessary evil of being deterministic.  While the template could accept a GUID by parameter, I find that asking my users for GUIDs is impolite - a uniquestring() function on the already-unique search name should suffice for most use cases.  Since the actions don't require uniqueness, static names will do.

Lastly, this template is not idempotent.  Idempotency is a characteristic of a desired state technology that focuses on ensuring the state (in our case, the template), rather than prescribing process (create this, update that, delete the other thing).  Each of Azure's types is responsible for ensuring that they implement idempotency as much as possible.  For some resources like the database extensions type, which imports .bacpac files, the entire type opts out of idempotency (once data is in the database, no more importing).  For others (such as Azure VMs), only changes in certain properties like administrator name/password break idempotency.  In the OMS search and schedule's case, if either one exists when you deploy the template, the deployment will fail with a code of BadRequest.  To add insult to injury, deleting the saved search from the interface does not delete the schedule.  Neither does deleting the alert from the UI - it simply flips the scheduled to "Enabled": false.  There is, in fact, no way to delete the schedule via the UI - you must do so via the REST API, similar to the following:

armclient delete /{Schedule ResourceID}?api-version=2015-03-20

This means that, for now, any versioned solution which desires to deploy ARM template updates to its searches and alerts will need to either:

  1. Delete the existing search/schedule, then redeploy OR
  2. Deploy the updated version side-by-side (named differently), then clean up the prior version's items.

Luckily ours was a POC, so deleting and re-creating was very much an option.  I look forward to days when search and alerting types become first-class, idempotent citizens.  Until then, I hope this approach proves useful!

The Template

The template below deploys a search, a schedule, a threshold action (the alert), and a webhook action (the action) that includes a custom JSON payload (remember to escape your JSON!).  With a little work, this template could be reworked to segregate the alert-specific components from the search so that multiple alerts could be deployed to the same search or adapted for use with Azure Automation Runbooks.  Enjoy!

{
  "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
  "contentVersion": "1.0.0.0",
  "parameters": {
    "Action-Payload": {
      "type": "string"
    },
    "Action-Uri": {
      "type": "string"
    },
    "Alert-DisplayName": {
      "type": "string"
    },
    "Search-Category": {
      "type": "string"
    },
    "Search-DisplayName": {
      "type": "string"
    },
    "Search-Query": {
      "type": "string"
    },
    "Workspace-Name": {
      "type": "string"
    },
    "Workspace-Location": {
      "type": "string"
    }
  },
  "variables": {
    "Action-Name": "webhook",
    "Alert-Name": "alert",
    "Schedule-Name": "[uniquestring(variables('Search-Name'))]",
    "Search-Name": "[toLower(concat(variables('Search-Category'), '|', parameters('Search-DisplayName')))]",
    "Search-Id": "[resourceId('Microsoft.OperationalInsights/workspaces/savedSearches', parameters('Workspace-Name'), variables('Search-Name'))]",
    "Schedule-Id": "[resourceId('Microsoft.OperationalInsights/workspaces/savedSearches/schedules', parameters('Workspace-Name'), variables('Search-Name'), variables('Schedule-Name'))]",
    "Alert-Id": "[resourceId('Microsoft.OperationalInsights/workspaces/savedSearches/schedules/actions', parameters('Workspace-Name'), variables('Search-Name'), variables('Schedule-Name'), variables('Alert-Name'))]",
    "Action-Id": "[resourceId('Microsoft.OperationalInsights/workspaces/savedSearches/schedules/actions', parameters('Workspace-Name'), variables('Search-Name'), variables('Schedule-Name'), variables('Action-Name'))]"
  },
  "resources": [
    {
      "name": "[concat(parameters('Workspace-Name'), '/', variables('Search-Name'))]",
      "type": "Microsoft.OperationalInsights/workspaces/savedSearches",
      "apiVersion": "2015-03-20",
      "location": "[parameters('Workspace-Location')]",
      "dependsOn": [],
      "properties": {
        "Category": "[parameters('Search-Category')]",
        "DisplayName": "[parameters('Search-DisplayName')]",
        "Query": "[parameters('Search-Query')]",
        "Version": "1"
      }
    },
    {
      "name": "[concat(parameters('Workspace-Name'), '/', variables('Search-Name'), '/', variables('Schedule-Name'))]",
      "type": "Microsoft.OperationalInsights/workspaces/savedSearches/schedules",
      "apiVersion": "2015-03-20",
      "location": "[parameters('Workspace-Location')]",
      "dependsOn": [
        "[variables('Search-Id')]"
      ],
      "properties": {
        "Interval": 5,
        "QueryTimeSpan": 5,
        "Active": "true"
      }
    },
    {
      "name": "[concat(parameters('Workspace-Name'), '/', variables('Search-Name'), '/', variables('Schedule-Name'), '/', variables('Alert-Name'))]",
      "type": "Microsoft.OperationalInsights/workspaces/savedSearches/schedules/actions",
      "apiVersion": "2015-03-20",
      "location": "[parameters('Workspace-Location')]",
      "dependsOn": [
        "[variables('Schedule-Id')]"
      ],
      "properties": {
        "Type": "Alert",
        "Name": "[parameters('Search-DisplayName')]",
        "Threshold": {
          "Operator": "gt",
          "Value": 0
        },
        "Version": 1
      }
    },
    {
      "name": "[concat(parameters('Workspace-Name'), '/', variables('Search-Name'), '/', variables('Schedule-Name'), '/', variables('Action-Name'))]",
      "type": "Microsoft.OperationalInsights/workspaces/savedSearches/schedules/actions",
      "apiVersion": "2015-03-20",
      "location": "[parameters('Workspace-Location')]",
      "dependsOn": [
        "[variables('Schedule-Id')]"
      ],
      "properties": {
        "Type": "Webhook",
        "Name": "[variables('Action-Name')]",
        "WebhookUri": "[parameters('Action-Uri')]",
        "CustomPayload": "[parameters('Action-Payload')]",
        "Version": 1
      }
    }
  ],
  "outputs": {}
}

Saving money in the cloud?

MoneyCloud.png

One of the cloud’s big selling points is the promise of lower costs, but more often than not customers who move servers to the cloud end up paying more for the same workload.  Have we all been duped?  Is the promise a lie? Over the past several years the ACE team (the group of experts behind the AzureFieldNotes blog) has helped a number of customers on their Azure journey, many of whom were motivated by the economic benefits of moving to the cloud.  Few take the time to truly understand the business value as it applies to their unique technology estate and develop plans to achieve and measure the benefits.  Most simply assume that running workloads in the cloud will result in lower costs - the more they move, the more they will save.  As a result, management establishes a "Cloud First" initiative and IT scrambles to find workloads that are low risk, low complexity candidates.  Inevitably, these end up being existing virtual machines or physical servers which can be easily migrated to Azure.  And here is where the problems begin.

When customers view Azure as simply another datacenter (which just happens to be in the cloud) they apply their existing datacenter thinking to Azure workloads and they negate any cost benefit.  To realize the savings from cloud computing customers need to shift into consumption-based models and this goes far beyond simply migrating virtual machines to Azure.  When server instances are deployed just like those in the old datacenter and left running 24x7, the same workload will most likely end up costing more in Azure.  In addition, if instances aren't decommissioned when no longer needed it leads to sprawl, environment complexity, and costs that quickly get out of control.

Taking it a step further, customers must also consider which services should continue to be built and maintained in-house, and which should simply be consumed as a service.  These decisions will shape the technical cloud foundations for the enterprise.  Unfortunately, many of these decisions are made based on early applications deployed to Azure.  We call this the "first mover" issue.  Decisions made to support the first app in the cloud may not be the right decisions for subsequent apps or for the enterprise as a whole, leading to redundant and perhaps incompatible architecture, poor performance, higher complexity, and ultimately higher cost.  Take identity as an example:  existing identity solutions deployed in-house are often sacred cows because of the historical investment and specialized skills required to maintain the platform.  Previously, these investments were necessary because the only way to deliver this function was to build your own.  But (with limited exception) identity doesn't differentiate your core business and customers don't pay more or buy more product because of your beloved identity solution.  With the introduction of cloud-based identity, such as Azure Active Directory, companies can now choose to consume identity as a service, eliminate the complexity and specialized skills required to support in-house solutions, and focus talent and resources on higher value services which can truly differentiate the business.

Breaking it down, there are a handful of critical elements that must be addressed for any customer to realize value in the cloud:

  • Business Case:  understand what is valuable to your business, how you measure those things, and how you will achieve the value.  The answers to these questions will be different for every customer, but the need to answer them is universal.  Assuming the cloud will bring value - whether you view value as speed to market, cost reduction, evergreen, simplification, etc. - without understanding how you achieve and measure that goal is a recipe for failure.
  • Cloud Foundations:  infrastructure components that will be shared across all services need to be designed for the Enterprise, and not driven based on the first mover.  Its not unusual for Azure environments to quickly evolve from early Proof of Concept deployments to running production workloads, but the foundations (such as subscription model, network, storage, compute, backup, security, identity, etc.) were never designed for production - you need to spend the time early to get these right or your ability to realize results from Azure will be negatively impacted.
  • Ruthless automation:  standardization and automation underpin virtually every element of the cloud's value proposition and you must embrace them to realize maximum benefit from the cloud.  This goes beyond systems admins having scripts to automate build processes (although that is a start).  It means build and configuration become part of the software development practice, including version control, testing, and design patterns.  In other words, you write code to provision and manage cloud resources and the underlying infrastructure is treated just like software:  infrastructure as code.
  • Operating Model: workloads running in the cloud are different from those in your datacenter and supporting these instances will require changes to the traditional operating model.  As you move higher into the as-a-Service stack (IaaS -> PaaS -> SaaS -> BPaaS etc.) the management layer shifts more and more to the cloud provider.  Introduce DevOps in the equation and the impact to traditional operating models is even greater.  When there is an issue, how is the root cause determined when you don't have a single party responsible for the full stack?  Who is responsible for resolution of service and how will hand-offs work between the cloud provider and your in-house support teams?  What tools are involved, what skills are required, and how is information tracked and communicated?  In the end, much of the savings from cloud can come from transformation within the operating model.
  • Governance and Controls:  If you thought keeping a handle on systems running in your datacenter was a challenge, the cloud can make it exponentially worse.  Self-service and near instantaneous access to resources is the perfect storm for introducing server sprawl without proper governance and controls.  In addition, since cloud resources aren't sitting within the datacenter where IT has full control of the entire stack, how can you be sure data is secure, systems are protected, and the company is not exposed to regulatory or legal risk?

In future posts I'll cover each one of these in more detail to help frame how you can maximize the value of Azure (and how Azure Stack can play an important role) in your cloud journey.