Crying Cloud

Azure Role-Based Access Control, Part 1

RBAC.png

  One of the advantages of the Azure Resource Management (ARM) deployment model is being able to delegate administration, something we couldn’t do under the Azure Service Management (ASM or RDFE) deployment model (see the previous posts on subscriptions and resource groups).  By delegating administrative authority, we can keep the number of subscription admins low and grant access in accordance with the principle of Least Privilege.  This post will walk through using Azure role-based access control (RBAC) to achieve all of this.

Azure RBAC Basics

Built-In Roles

There are three general roles that apply to all resource types:

  • Owner has full access to the in-scope resources
  • Contributor has the same access to in-scope resources as Owner but cannot manage access
  • Reader can only view in-scope resources

In addition to these general roles, there are other built-in roles that are resource-specific.  Microsoft has a list of those roles, but because Azure is changing regularly it’s best to use PowerShell to query for the roles:

[powershell]

# list all of the RBAC roles in the subscription (Get-AzureRmRoleDefinition).Name

# see the actions a specific role is allowed to perform (Get-AzureRmRoleDefinition -Name '<Role Name>').Actions

# see the actions a specific role is *not* allowed to perform (Get-AzureRmRoleDefinition -Name '<Role Name>').NotActions

[/powershell]

Azure Active Directory (AD) users and groups can be assigned to any Azure RBAC role.  The Azure AD tenant to which the subscription belongs is the source tenant for users and groups.

Custom Roles

If you find that the built-in roles aren’t sufficient, you can create custom roles which will be the topic of Part 2.

Resource Hierarchy

There is a hierarchy to containers and resources:

  • The subscription is the top-level container that can house resources
  • A resource group (RG) can belong to a single subscription (and can’t span subscriptions)
  • A resource can belong to a single resource group

Any access granted to a parent container is inherited by all children.  For example, an account granted read access at the subscription level can see all resources in the subscription.  By the way, there is no concept of denying access in Azure RBAC, so be very careful and deliberate about granting wide-spread access.

ARM/New Portal vs. ASM/Classic Portal/RDFE

Back in the ASM days, to access an Azure subscription one needed to be a subscription admin (or co-admin).  Those admins are automatically granted subscription owner access in ARM.  However, accounts granted owner access in ARM are not automatically granted subscription co-admin access.  If you have resources that you need to manage that are not yet available in ARM you will still need to manage the co-admins list.  In either case the recommendation is still the same: keep the number of subscription admins (granted either through Azure RBAC or directly through subscription admins) as low as possible.

Azure RBAC in Practice

Recommendations for a resource group strategy were discussed in a previous post, so I won’t rehash that content.  What I do want to talk about is how to implement your strategy.

Creating RGs and Delegating Admin

RGs must be created by an account with subscription owner role.  Once the RG is created an RG owner must be assigned to a user who is the actual owner of the RG (owner as in someone who is responsible for the resources that the RG will contain).  Once that owner has been established additional roles can be added by the new owner.  The advice for RGs is the same as subscriptions: keep the number of owners as low as possible – not everyone who needs access to the RG needs to be an owner of the RG – use Contributor instead.

Managing Roles

To keep the administrative overhead as low as possible, use Azure AD groups to manage role membership.  Create a group and add user accounts to the group.  If you’re syncing your on-premises AD Domain Services (DS) with Azure AD create the group in AD DS and let it sync to Azure AD.  Manage these groups using your existing on-premises user and group management process.

Infrastructure RGs

Contra much of the advice published on the Internet not every service or application should get its own VNet and not every virtual machine (VM) should get its own storage account.  VNets and storage accounts for VM disks (VHDs) are infrastructure resources and need to be managed as such.

Network

Create an RG for the VNets and place all the VNets into that RG.  Create a group, name it something descriptive like Network Consumers, and grant it the Reader and Virtual Machine Contributor roles to the VNet RG.  Any user account in that group will be able to attach VMs to any of the VNets in the VNet RG.

VHD Storage Accounts

Create an RG for the VHD storage accounts and place all the VHD storage accounts into that RG.  Create a group, name it something descriptive like Storage Consumers, and grant it the Reader and Virtual Machine Contributor roles to the VHD storage account RG.  Any user account in that group will be able to use any of the storage accounts in the RG to store a VM’s VHDs.

App Service Plan – Outbound Network Connection Limit

Few days back I ran into a problem where our production azure web apps were throwing below error:

[SocketException (0x271d): An attempt was made to access a socket in a way forbidden by its access permissions x.x.x.x:80] System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress) +208 System.Net.ServicePoint.ConnectSocketInternal(Boolean connectFailure, Socket s4, Socket s6, Socket& socket, IPAddress& address, ConnectSocketState state, IAsyncResult asyncResult, Exception& exception) +464

We opened a case with Microsoft and upon investigation they told us that your App Service Plan (running on Standard S1 2 Instances) are hitting the outbound connection limit. What? How the heck we know that? As of when i am writing, below were the connection limits given my MS.

App Service Plan Connection Limit
Free F1 250
Shared D1 250
Basic B1 1 Instance 1920
Basic B2 1 Instance 3968
Basic B3 1 Instance 8064
Standard S1 1 Instance 1920
Standard S1 2 Instances 1920 per instance
Standard S2 1 Instance 3968
Standard S3 1 Instance 8064
Premium P1 1 Instance (Preview)  1920

 

On further request, MS gave us a table of apps under the app service place and their open socket connection count. It clearly indicates that Web App 1 worker process is not reusing the connection pool and creating new connections hitting the overall limit of the app service plan.

WebApp Name Process Name Open Socket Count
App2 <WebJob>.WebJob.exe 4
App2 <WebJob>.WebJob.exe 4
App2 w3wp.exe 2
App1 <WebJob>.WebJob.exe 8
App1 <WebJob>.WebJob.exe 4
App1 <WebJob>.WebJob.exe 6
App1 w3wp.exe 2
App1 w3wp.exe 1870
App1 <WebJob>.WebJob.exe 6
App1 w3wp.exe 2
App1 <WebJob>.WebJob.exe 6
App3 w3wp.exe 4
App3 w3wp.exe 2
Total 1920

 

With the above data from MS at least you would be able to know where the problem lies and can review the app again.

For your web apps, you can at least review the code (ensuring it doesn't happen to your azure web apps) where you are handing the connection with external entities. Some of the common external dependencies in modern cloud world are:

  1. SQL - https://azure.microsoft.com/en-us/documentation/articles/sql-database-develop-dotnet-simple/
  2. Redis - https://azure.microsoft.com/en-us/documentation/articles/cache-dotnet-how-to-use-azure-redis-cache/
  3. Service Bus - https://azure.microsoft.com/en-us/documentation/articles/service-bus-performance-improvements/

Thanks to the blog http://www.freekpaans.nl/2015/08/starving-outgoing-connections-on-windows-azure-web-sites/ which explains about the same problem.

However the fact is with no monitoring tool available which monitors the open socket count, you will never be able to know the number of open socket connections for your app service plan unless requested from Microsoft.

Troubleshooting automatic restart of Azure Web Jobs

Lately I was working on production issue where the web jobs hosted on azure were automatically restarting by itself or moving in stopped state. There was no signs of user manually restarting it (you can watch those via Activity Logs on your website). So the question is why was it happening? The answer lies in the web job logs. In order to understand it better, I have laid out mostly all reasons of automatic restart with sample logs (you can get the log from either Kudu web job dashboard or directly from storage account).

Reason 1: Due to website shutdown/restart

[10/24/2016 06:53:36 > b3e7a2: SYS INFO] WebJob is stopping due to website shutting down [10/24/2016 06:53:36 > b3e7a2: SYS INFO] Status changed to Stopping [10/24/2016 06:53:39 > b3e7a2: INFO] Job host stopped [10/24/2016 06:53:41 > b3e7a2: ERR ] Thread was being aborted. [10/24/2016 06:53:42 > b3e7a2: SYS INFO] WebJob process was aborted [10/24/2016 06:53:42 > b3e7a2: SYS INFO] Status changed to Stopped [10/24/2016 06:59:17 > 521cd3: SYS INFO] Status changed to Starting [10/24/2016 06:59:20 > 521cd3: SYS INFO] Run script '<YourWebJobName>.WebJob.exe' with script host - 'WindowsScriptHost' [10/24/2016 06:59:20 > 521cd3: SYS INFO] Status changed to Running

Reason 2: Due to changes in the azure web job directory files or file content (D:\home\site\wwwroot\app_data\jobs\continuous\<WebJobName>)

[10/29/2016 00:00:43 > 521cd3: SYS INFO] Detected WebJob file/s were updated, refreshing WebJob [10/29/2016 00:00:43 > 521cd3: SYS INFO] Status changed to Stopping [10/29/2016 00:00:47 > 8d7eea: SYS INFO] Detected WebJob file/s were updated, refreshing WebJob [10/29/2016 00:00:47 > 8d7eea: SYS INFO] Status changed to Stopping [10/29/2016 00:00:48 > 521cd3: INFO] Job host stopped [10/29/2016 00:00:49 > 521cd3: ERR ] Thread was being aborted. [10/29/2016 00:00:51 > 521cd3: SYS INFO] Status changed to Stopped [10/29/2016 00:00:51 > 521cd3: SYS INFO] Status changed to Starting [10/29/2016 00:00:52 > 521cd3: SYS INFO] WebJob process was aborted [10/29/2016 00:00:52 > 521cd3: SYS INFO] Status changed to Stopped [10/29/2016 00:00:52 > 521cd3: SYS INFO] Job directory change detected: Job file 'ApplicationInsights.config' timestamp differs between source and working directories. [10/29/2016 00:00:51 > 8d7eea: INFO] Job host stopped [10/29/2016 00:00:52 > 8d7eea: ERR ] Thread was being aborted. [10/29/2016 00:00:52 > 8d7eea: SYS INFO] WebJob process was aborted [10/29/2016 00:00:52 > 8d7eea: SYS INFO] Status changed to Stopped [10/29/2016 00:00:53 > 8d7eea: SYS INFO] Status changed to Starting [10/29/2016 00:00:54 > 8d7eea: SYS INFO] Job directory change detected: Job file 'ApplicationInsights.config' timestamp differs between source and working directories.[10/29/2016 00:01:01 > 521cd3: SYS INFO] Run script '<YourWebJobName>.WebJob.exe' with script host - 'WindowsScriptHost' [10/29/2016 00:01:01 > 521cd3: SYS INFO] Status changed to Running

Reason 3: Due to a web app and/or web job deployment

Easy to understand since it will trigger Scenario 2

Reason 4: Due to an azure outage or maintenance

Assuming you can rule out reason 1, 3 and 4 (straightforward too) by using standard azure web app monitoring tools (like Failure History and others), 2 would still require further analysis.

Basically Reason 2 indicates that if there is a change in the web job directory i.e. a new file/folder is added or removed automatically or manually, web jobs will restart. Or if the content of the file within the directory is changes, that would also initiate trigger in web job restart.

In that case need to dig further what triggered the directory and/or file content changes within web job directory. You need to ask below question:

  1. Was there any changes to web app settings via azure portal?
  2. Was there any runtime SDK upgrades? Like upgrading App Insights or installing an extension
  3. Are you creating any temporary files at runtime with the web job directory?

You might want to review the application design if 3 is correct.

This helped me fixing my production issue. Hope same for you.