Terraform and WSL2 issue

Here’s a quick note on an issue that I encountered today (plus it seems, many other people).

I went to run a Terraform workflow on my system via WSL2, but I cam across a number of problems.

First, was that I couldn’t obtain the State that was stored in an Azure Storage account container. Previously, I used the following config:

backend "azurerm" {
    resource_group_name  = ""
    storage_account_name = ""
    container_name       = "terraform-backend"
    key                  = ""
 }

At runtime, I would specify the values like the example below.

export TF_CLI_ARGS_init="-backend-config=\"storage_account_name=${TERRAFORM_STATE_CONTAINER_NAME}\" -backend-config=\"resource_group_name=${RESOURCE_GROUP_NAME}\" -backend-config=\"access_key=${STG_KEY}\""

However, today, that didn’t work as it just stalled trying to connect to the storage container.

I thought it was something wrong with my credentials, so for troubleshooting purposes, I added the storage account key to see if that made a difference

backend "azurerm" {
    resource_group_name  = ""
    storage_account_name = ""
    container_name       = "terraform-backend"
    key                  = ""
    access_key           = ""
}

I added the primary storage key and lo and behold, this time, it worked.

Strange, as I hadn’t updated the terraform cli or providers.

The next problem I saw was that when I tried to run

terraform plan

it would not complete, seemingly freezing. To troubleshoot this, I ran

export TF_LOG="TRACE"

before running the plan to tell me what was happening in the background.

This in turn produces a verbose output, but something that did catch my was this:

Strange. I know I have internet connectivity and I could certainly connect to Azure using az cli, so I did some Goole-fu and found the following: https://github.com/microsoft/WSL/issues/8022

It was exactly the same problem I had encountered.


Applying the fix https://github.com/microsoft/WSL/issues/5420#issuecomment-646479747 worked for me and persisted beyond a reboot.

(run the code below in your WSL2 instance)

sudo rm /etc/resolv.conf
sudo bash -c 'echo "nameserver 8.8.8.8" > /etc/resolv.conf'
sudo bash -c 'echo "[network]" > /etc/wsl.conf'
sudo bash -c 'echo "generateResolvConf = false" >> /etc/wsl.conf'
sudo chattr +i /etc/resolv.conf


It appears to have occurred in the latest Windows update and affects WSL2. It only appears to affect Go / Terraform as far as I can tell.

Hopefully this will help anyone having a similar issue until the Go provider is fixed.



Fixing Azure Firewall Monitor Workbook

TLDR; Here’s a version of The Azure Firewall Workbook that I fixed: https://github.com/dmc-tech/az-workbooks :)

For a client project, I had to deploy an Azure Firewall and I want to ease the monitoring burden, so I deployed the Azure Monitor workbook as per the article here.

The article has a link to a Workbook that can be deployed to your Azure subscription, and is a great resource giving you plenty of insight into what activity has been taking place on the firewall, via a Log Analytics Workspace configured as part of the diagnostics settings for the resource.

However, I did notice that some of the queries didn’t work as expected and produced some interesting results for the Application rule log statistics.

Below is an example:

If you check out the Action column, you can see that it has quite a lot of information, where I would expect to see ‘Allow’ or ‘Deny’.

I also noticed that some of the other panes did not return any results (such as above), when I expected to see data, so I dug a little deeper, having not really had experience of editing Workbooks.

First of all, I had to check the underlying query, so had to go into ‘edit’ mode.

Once in edit mode, I selected one of the panels that was affected by the faulty query (anything concerning ‘Allow’ for Application Log. Click on the ‘Edit’ button.

We’re concerned with checking the logic and parsing the log, so that the Action is correctly represented, plus the Policy and Rule Collection are populated.

To help triage. I opened the query in the Logs view.

I’ve highlighted where the issues were. First, the logic was incorrect, so the query above was matched, and that did not parse the msg_s field correctly. Second, the parse missed out the ‘space’ for Policy and Rule Collection Group, so would capture incorrectly.

Here’s how the query should look:

Add and msg_s !has “Rule Collection Group as indicated; remove the highlighted and msgs_s !has “Rule Collection , and add spaces as indicated to the parse statement correctly attributes the values to the parameter.

You can see in the query results that the Allow entries no longer have the additional Policy:… text added.

Now that we’ve identify the issue, we need to update the Workbook.

Go back to the workbook end edit the query, putting the identified fixes in place.

Remember to click ‘Done Editing’ when you’re finished.

Here’s a snippet of the query:

(
materializedData
| where msg_s !has "Web Category:" and  msg_s !has ". Url" and msg_s !has  "TLS extension was missing" and msg_s !has "No rule matched" and msg_s !has "Rule Collection Group"
| parse msg_s with Protocol " request from " SourceIP ":" SourcePort " to " FQDN ":" DestinationPort ". Action: " Action ". Rule Collection: " RuleCollection ". Rule: " Rule
),
(
materializedData
| where msg_s !has "Web Category:" and  msg_s !has ". Url" and msg_s !has " Reason: "
| where msg_s has "Rule Collection Group"
| parse msg_s with Protocol " request from " SourceIP ":" SourcePort " to " FQDN ":" DestinationPort ". Action: " Action ". Policy: " Policy ". Rule Collection Group: " RuleCollectionGroup ". Rule Collection: " RuleCollection ". Rule: " Rule
)

Great, we’ve fixed one panel, unfortunately there are more. I’ve shown the process I used to fix the queries, so you can go on and find the the other panels with the same issues and fix yourself, or just go ahead and import a fixed version of the workbook that I uploaded :)

https://github.com/dmc-tech/az-workbooks

Configuring Azure Application Gateway for accessing Kibana

Here’s a quick post on how to configure Azure Application Gateway for any instance of Kibana that is being protected will work.

Background:

I’m working with a private OpenShift cluster deployed to Azure (not ARO, it was deployed via IPI), that I want to publish to the public, but protected via the App GW WAF.

Once the cluster had been deployed and is published via the App Gateway, when trying to access Kibana, an internal 500 error was displayed. If accessing directly from within the virtual network, it worked fine, so I know it’s definitely the App Gateway causing the issue.

Looking at the Kibana logs, I saw the following:

Although I obfuscated my Public IP address, you’ll notice that the port is appended, could this be the problem? (of course the answer is yes, IP address:port isn’t a valid IP address!)

The error message kind of gives the clue:

"message": "Cannot resolve address x.x.x.x:50735: [security_exception] Cannot resolve address x.x.x.x:50735"

I needed to figure out how to rewrite the request header so it would work

(I won’t talk about how I setup backend address pools, http settings, frontend ports, listeners and probes, as that will be part of future in-depth post on how to do it, but I will describe a particular rewrite rule required so that Kibana works.)

X-Forwarded-For Rewrite Rule

The offending header is X-Forwarded-For . This is added by the Application Gateway which includes the IP + port. Microsoft describe this here.

From the portal, open up your Application Gateway and open up Rewrites

Add a Rewrite set

Give your rule a name that is something meaningful, or just go with the defaults.

Click on the Click to configure this action link (1) and enter the settings below. Once configured, don’t forget to Update the rewrite set.

Once the rewrite rule was in place, Kibana opened as expected.

Azure Bastion undocumented requirement gotcha

Just a quick post to highlight an undocumented requirement for Azure Bastion that I came across when deploying a Landing Zone.

I’m creating a new landing zone for a client and we’re using Azure Bastion for secure access to IaaS VM’s. I decided to create the resource in a separate resource group than the Virtual Network as it was uncertain whether this was going to be required long term or not. There’s nothing in the current documentation that indicates that it isn’t possible, so I tried to deploy.

After a few minutes, it failed:

Here’s the less than helpful error:

No matter what I tried (Portal, Terraform, Azure CLI), the same occurred.

Upon speaking to Azure Support, this is a known issue and the mitigation is to deploy the Bastion host within the same Resource Group as the Virtual Network that it is trying to connect to.

I’ve experienced the same when deploying API Management in Azure, but at least the errors from ARM are meaningful and pointed me in the right direction.

Hopefully if you come across the same, and the problem isn’t resolved, this will help you out.

Deploying Azure Stack HCI 20H2 on PowerEdge R630

Ever since the new version of Azure Stack HCI was announced at Microsoft Inspire 2020, there has been a real buzz about the solution and where it is going. Thomas Maurer has written a great article detailing the technology improvements here. If you haven’t read it, I highly recommend you do!

If you want to get hands-on, there are a few options here to do so:

  • Deploy the preview to existing Hardware

  • Deploy on your existing Virtualization platform (use Nested Virtualization)

  • Deploy to a VM in Azure (using a VM Series that supports Nested Virtualization)

Matt McSpirit has written a fantastic step-by-step guide for deploying to virtualized platforms; again - highly recommend reading it.

You can also check out MSLab https://github.com/microsoft/MSLab which automates the deployment to Hyper-V environments

However, I want to write about deploying to physical hardware that I’m lucky enough to have access to that will be more representative of what real world deployments will look like. I won’t go into the full process, as the Microsoft documentation goes through all the steps, but I will explain how I installed it onto physical servers that aren’t officially supported

Hardware

I have access to a number of Dell EMC PowerEdge R630 servers that were originally used to host ASDK on , so I know they meet the majority of the requirements to run Hyper-V / S2D. Dell EMC don’t officially support this server for HCI, but I’m only using it to kick the tires, so I’m not bothered about that.

Each Server has:

  • 2 x 12 core Xeon E5-2670 processors

  • 384 GB RAM

  • 5 x 480GB SSD drives

  • Perc H730 mini RAID controller

  • 4 x Emulex 10GB NICs

The only real issue with the list above is the H730 controller. As it is RAID, it could cause issues, and Dell EMC recommend using a HBA330 controller, as per this thread. I don’t have any HBA330’s to hand, so I had to make do and try and make it work (spoiler alert: I did get it working, details later!)

Preparing the Hardware

The first thing I had to do was to ensure that each server was configured correctly from a hardware perspective. To do this, connect to the iDRAC interface for each of the servers and make the following changes, if needed.

First, I had to make sure the Perc R730 controller is set to ‘HBA’ mode.

Navigate to Storage / Controllers

From the Setup tab, I checked the current value for the Controller Mode. It should be set to ‘HBA’

If it isn’t, from the corresponding Action dropdown, select HBA.

From the Apply Operation Mode dropdown, select At Next Reboot and then click on Apply

Next, I navigate to iDRAC Settings / Network. Select the OS to iDRAC Pass-Through tab. Make sure Pass-through configuration is Disabled, and Apply changes if necessary

Reboot the server now for the changes to apply to the controller mode.

Next on the list was to ensure that your BIOS and Firmware are up to date. I used the Sever Update Utility from the LifeCycle controller, as documented here to do this.

Once all that was complete, I could start to deploy the Azure Stack HCI OS.

Installing Azure Stack HCI OS

There’s two things that we need to download; Azure Stack HCI ISO and Windows Admin Center (latest version is 2103.2). To get the necessary files, you need to sign up for the public preview. Fill in your details here and go grab them. WAC can be downloaded here as well.

Once you have the ISO, go ahead and connect to your Virtual Console and connect the Virtual Media (version downloaded via the Eval link is AzureStackHCI_17784.1408_EN-US.iso). On more than one occasion I forgot to click on Map Device, that wasted a few minutes!

To make things a little easier, go to the iDRAC, Server / Setup. Change the First Boot Device to Virtual CD/DVD/ISO and then apply the changes.

Next thing is to restart the server. I had to make sure I didn’t get distracted as after a couple of minutes of running through the BIOS steps, it prompts for you to press any key to boot from the CD/DVD. The image below shows it’s going to try and boot from the Virtual CD…

…and the following is if you don’t press a key in time :)

I won’t detail the installation process of the Azure Stack HCI OS, as there are very little configuration items required, just make sure you select the correct drive/partition to install the OS on (Drive 0 for me, I wiped all the volumes/partitions)

After a while (it can be slow doing an install via the Virtual Media) HCI OS will be installed. You’ll need to set an administrator password on first login.

Once that’s set, the config menu appears.

From here, first thing was for me to confirm I had a valid IP address, so selected 8 and then the NIC I wanted to use for management. The only other thing I really had to do was to set the computer name, so I did that, but I also added the system to a domain, just so I knew that name resolution and networking was working as expected.

I rinsed/repeated for all the servers (4 of them) that I wanted to form my HCI cluster.

Follow the instructions here to deploy the cluster.

Fixing Storage Spaces Direct Deployment

When going through the cluster installation, I did encounter one error that blocked the deployment of S2D. This could be seen in the Failover Cluster Validation report.

List All Disks for Storage Spaces Direct Failed

List All Disks for Storage Spaces Direct Failed

Bus type is RAID - it should be SAS, SATA or NVMe

Bus type is RAID - it should be SAS, SATA or NVMe

Fortunately, we can change this by running the following PowerShell commands

(get-cluster).S2DBusTypes="0x100"
S2DBusTypes should report back as Decimal 256

S2DBusTypes should report back as Decimal 256

Running the Cluster Validation again will now report as Success

13-ASHCI-R630-S2DReportSucess.png

You can now go ahead and create the S2D cluster :)