In a recent consulting engagement, I’ve needed to perform a large-scale migration of a company’s virtual machine (VM) fleet from an On-premise datacenter to Microsoft Azure. Thinking about what that actually means – We’re picking up many compute workloads that are (in most cases) essential for day to day business operation and re-homing them to a new slice of a Microsoft-managed datacenter. After coming out the other end and completing the project, I thought I would shed some light on the tools that I used and developed to make the vision a reality.
In this particular engagement the customer is a large enterprise with a VMware environment servicing 300+ VM’s. When we consider the business value behind each of these compute workloads, it quickly becomes apparent that selecting the right tooling and approach is vital to deliver a successful outcome whilst causing as little disruption to the business as possible.
Enter Azure Site Recovery
Azure Site Recovery (ASR) is Microsoft’s Disaster Recovery as a Service solution which can replicate workloads running on physical and virtual machines from one location to a secondary location. As a disaster recovery platform, it’s possible for workloads to failover and successfully failback in a disaster scenario. ASR can also be used to migrate workloads to Azure by completing the failover component without failing back.
Why Should We Automate Azure Site Recovery?
I like to automate things like this because a computer following a process that someone writes will always perform it the same way. We can expect what the output will look like. In this case that means a like for like VM that looks and feels like it did in its previous life, before being migrated. When we introduce an operator into the mix we also introduce the human element. Things like resource names and groups, VM specs, disk settings, network location and ip addresses all need to be configured for each VM migration.
To have success running migrations at scale, it is important to use known, well-tested, repeatable processes. For me, that means figuring out the best way to use a tool then automating it so that you (or anyone else) can use it the right way, everytime, easily.
How Can We Automate Azure Site Recovery?
I use PowerShell as an automation tool on top of ASR for a couple of reasons. The main reason being that Microsoft provide and maintain a set of PowerShell modules for interacting with Azure resources, including ASR. This is known as the Az module – See our previous post on the Azure PowerShell Az module for a deeper explanation. PowerShell can also run almost anywhere thanks to PowerShell Core, a cross-platform edition of PowerShell that runs on Windows, macOS and Linux.
Armed with PowerShell and the Az module, we can get cracking on with the fun stuff – bashing out some lines of code. My approach and methodology here usually involve a fair bit of back and forth, playing with the commands that are available to me and learning the best ways to drive them. Importantly, you don’t want to do this with live data, setting up an isolated sandpit with dummy data will go a long way in allowing you to upskill knowledge around the tools while making sure your production systems remain untouched.
Once I’ve got a handle on the commands that are needed and how they fit together, I make a MVP (minimum viable product) script. The idea here is to demonstrate that its possible for the tooling to work (it’s not pretty but it works). To paint a picture, one of my MVP scripts will usually have a bunch of variables at the start declaring all the info that is required, things like VM name, source location, target location etc. From here, I usually design the script to be ran line by line, this is mostly for simplicity sake, complexity can come later, right now it just needs to be as simple as possible. At this stage, we can demonstrate our capability to perform a migration with PowerShell. A quick example of this is setting up a replication job, preceding this line, I do a series of get statements to build up all the variables seen in the command bellow.
$replicationJob = New-AzRecoveryServicesAsrReplicationProtectedItem -VMwareToAzure -ProtectableItem $vm -Name (New-Guid).Guid -ProtectionContainerMapping $replicationPolicy -ProcessServer $ProcessServer -Account $Account -RecoveryResourceGroupId $ResourceGroup.ResourceId -logStorageAccountId $LogStorageAccount.Id -RecoveryAzureNetworkId $vnet.Id -RecoveryAzureSubnetName $failoverSubnetName
From here, I like to put some lipstick on it and make it feel like a more polished product. Personally, I like to use a series of questions and prompts to generate the variables I described in the last paragraph. I also add status checks and operator prompts to continue. An example of this could be when performing a failover, once the operator confirms he is ready to begin, the command executes the failover, then continuously checks the failover job status until it has completed, once completed, tell the user running the script that its complete. Here is an example of a status check that I wrote for checking the progress of a failover job.
Write-Host "======== Monitoring Failover ========"
Write-Host "This will refresh every 5 seconds."
$failoverJob = Get-ASRJob -Name $failoverJob.Name
Write-Host -ForegroundColor Red "ERROR - Unable to get status of Failover job"
Write-Host -ForegroundColor Red "ERROR - " + $_
log "ERROR" "Unable to get status of Failover job"
log "ERROR" $_
Write-Host "Failover status for $($VMName.FriendlyName) is $($failoverJob.State)"
} while (($failoverJob.State -eq "InProgress") -or ($failoverJob.State -eq "NotStarted"))
Once you get this far, the sky is the limit. Like most things, it can evolve over time. I like to add error handling and logging so we can elegantly handle a failure and have an audit trail of operations. I take this approach with most of the processes I automate, I think it’s important to start small and work up from there.