Tuesday, September 1, 2015

Automatic Shutdown of Idle Machines with vRealize Operations and vRealize Automation

(Credit to my colleagues for this work: James Polizzi, Scott Stickells)

We were engaged at a customer recently who had an issue with over-consumption of their development platform.  This is probably familiar to lots of us, and the customer was interested in how vRealize Suite could help solve that.

As we know, vRealize Automation helps to attach ownership to resources, identify an appropriate lease time for non-permanent machines, and help with the entire lifecycle of virtual machines.  vRealize Operations helps even further by identifying inefficiencies in the environment – in particular, those virtual machines that are chewing up resources inappropriately, whether over-sized, idle and unused, or perhaps just hogging storage.

In our customer’s situation, they not only wanted to control the lifecycle of virtual machines, but also ensure that ONLY needed virtual machines were running on the platform.  That meant figuring out which virtual machines were idle and unused, and shutting them down immediately – all without human intervention, review or overhead.  To help address this, we looked at two main elements of extending the vRealize Suite:

  1. A policy in vRealize Operations that could be associated with the development environment, identify idle machines (with the appropriate policy to define what “idle” meant), and AUTOMATICALLY take an action to call vRealize Orchestrator and shutdown the idle machines. 
  2. A workflow in vRealize Orchestrator that would take the parameters from vRealize Operations, find the machines under vRealize Automation, invoke the clean and controlled “Shutdown” action, and notify the associated machine owner of the action that had been taken.


vRealize Operations

(Just as a foreword, one of the key features that is soon to be released in vRealize Operations 6.1 is the ability to fully automate “Smart Alerts” without ANY user intervention.  So for the purpose of this blog, keep in mind that we are using the vRealize Operations 6.1 BETA release)



To achieve the required outcomes there were a few elements that we configured in vRealize Operations. 


Firstly, we installed a customised version of the vRealize Orchestrator adapter available here.  When I say customised, what we did was add an additional vRealize Operations action called “VM: Auto Power” that references the vRealize Orchestrator workflow we created (detailed below).  In theory, this could reference any workflow you require. While we will see this process simplified in future releases, today it requires assistance from VMware Professional Services – or some detailed know-how of Solution Pack design!





Secondly, we created a “Custom policy” which we will assign to a Custom group of objects. This allows us to selectively enforce our automated alert and define exactly what we consider to be an “Idle VM”. 







Next we created a “Custom group” of objects (VMs) that we wanted to target with this automated remediation action. We have the ability to filter on a number of different aspects.  One of the more popular ways is to create a dynamic group using a vSphere tag, or perhaps using folder structures in vCenter.  In our case, for simplistic testing, we statically assigned particular workloads to this group using “Objects to always include”.  After this group is created, it needs to be referenced in the custom policy specified above.  Take care with the order in which you perform these actions, to ensure you don’t accidentally start to automatically shutdown idle machines across all vRA-managed workloads!




After this we configured a new “Smart Alert” with the condition “VM is Idle = 1” and add the action “VM: Auto Power”, to trigger our Orchestrator workflow when the workflows are detected as being idle.



Finally, we activated the alert - by “editing” our recently created policy, finding the alert in question (in section 5. Override Alert/Symptoms Definitions) and selecting “enable” for both “state” and “automate”. In other words, we enable this alert locally in the policy and enable the automate action when the given conditions are true!

Note: the “Automate” option is disabled on all alerts by default - which is definitely a good thing considering the power of this functionality!



Once this is all complete, you will see that when any workload, that is a member of the given group/policy that “is idle”, we will automatically initiate the Shutdown workflow and send an email notification to the owner of the workload!


vRealize Orchestrator

In this workflow, which you can download here, we still need to manually configure a few items first.  This is done by invoking the “Configuration” workflow first, which makes it a little easier to draw out the static environment inputs needed by the workflow such as your email server, port, etc.




The workflow is invoked by vRealize Operations with a virtual machine identifier, an event correlation identifier (for logs and auditing), and the unique ID for the vCenter Server (in case your vRealize Operations is monitoring multiple vCenters).  Virtual machines have quite a few different types of identifiers – in this case, vRealize Operations is passing the Managed Object ID which is of the type “vm-9110”.  This identifier is not guaranteed to be unique, especially across multiple vCenters.  More importantly, it isn’t the one tracked by vRealize Automation.  The workflow, therefore, needs to use the given identifier to find the “instanceUUID” of the VM from vCenter, which is also a key attribute (“vmUniqueId”) in vRealize Automation.

Next, the workflow extracts the matching VM object from vRealize Automation, and also the email address of the object’s owner.  It uses this later to send an email notification to the owner of the idle machine’s shutdown, with the in-built mail workflows.  In vRealize Automation, the Shutdown action is not something that can be called by name, but rather has its own UUID which has to be found.  In the workflow, we enumerate all the available actions for the identified virtual machine, and find the one named “Shutdown”.  There is a chance that this action won’t be found – most especially if the virtual machine is already off!  This could also be a possibility if the virtual machine is in the middle of some other state (being reconfigured), or if the Shutdown action isn’t exposed by the blueprint owner.

As you can see by the below screenshots, this workflow has a relatively small number of steps (if you ignore the helpful logging parts!), and just works as designed.



In the email notifications, we found a way to reconstruct the URL directly to the virtual machine in vRealize Automation, and hence help the VM owner to quickly go and power their machine straight up again if needed.  This leads me onto the conclusions for this article…




Conclusions

While this workflow does exactly what was desired – shutting down idle virtual machines – it doesn’t have the whole picture covered yet.  From a technical perspective, should the workflow look to “Power off” the machine if Shutdown doesn’t work?  How should it handle multi-machine blueprints (we didn’t test those)?

A really helpful consequence of invoking the Shutdown action through the vRealize Automation layer is that any existing Approval steps or further custom extensions will still be invoked, just as if the user had tried to manually shutdown their virtual machine.  This opens up the potential for the system to be controlled and reviewed through a business process, using the standard vRealize Automation features.

This opens up a line of questions around whether ANY machines should be subject to automatic shutdown without human intervention and review.  The answer for some environments will be an emphatic “NO”, because the impacts won’t be well understood straight away.  Other environments already do this, however, because the private/public cloud use cases are build within a certain operational culture from the outset.  So, before you take this work and throw it into your own environment, ask some questions about what virtual machine populations this would be suitable for, what would happen if idle machines magically became unavailable, and whether this is really addressing your core problems.  You might find confirmations that this is indeed the right way to go, but you might also find that the basic features of vRealize Automation allows enough accountability, lease control and cost visibility to address most of the wastage inside your infrastructure.

Happy automating!