Monday, September 9, 2013

Performance definitions that should exist

When it comes to Cloud Computing, understanding service and performance definitions is a key hurdle, and one that is usually new to organisation looking to make a move to a Private Cloud, or perhaps considering adoption of a Cloud Provider.

When Cloud Computing was first being talked about, and particularly Private Cloud situations, I recall VMware sales teams (including me) illustrating service tier distinctions based on such things as uptime.  For example, tier 1 workloads get vSphere HA, and not tiers 2 or 3.  But looking back on this, whatever was real the operational cost of having vSphere HA enabled?  Certainly nothing to do with human effort.  Maybe some extra capital cost for some redundant hardware, but that should exist in all vSphere environments anyway.  I haven't seen any vSphere implement that intentionally build in NO redundant Ethernet/FC links, or host failover capacity.  It just isn't sensible to do this for the mere sake of creating a lesser tier of infrastructure.

The next key candidate for definitions used with tiers of service was performance, which is the topic of this blog article.  This implied an expensive and high performance tier, versus a cheaper but lower performance tier.  

The example representation of these performance tiers was storage types.  That is, tier 1 applications would be put on fast disks, maybe some nice 15K RPM Fibre Channel drives, and tier 2 applications would be put on relatively slower, cheaper disks such as some 7200 RPM SATA drives.  OK, that sounds somewhat justifiable.  But what happens if the fast disk was hopelessly over-committed (because in the absence of an effective cost model, everyone selected the "best" levels!) and the tier 1 applications got terrible performance.  Meanwhile, the "cheap" SATA drives might have been under-used, or lucky enough to have non-demanding applications, and performed the same as the tier 1 disks?  Whoops.  Something is clearly missing from these common examples of performance definitions.

Here's my proposal: define performance in terms of guarantees of response from the infrastructure.  Specifically, what I mean is that when an application needs to perform disk I/O, it gets a certain response time.  When it needs access to CPU or memory, it does or doesn't have to wait/contend for these resources.  Putting this into "VMware engineer" speak, or my version of it, the performance definition might contain terms such as the list below.  Obviously, the actual numbers here are fictitious and would need to reflect a real SLA.

CPU  


  • Maximum/peak speed guaranteed = 2.2GHz.  This would be max vCPU speed, matched or bettered by the underlying hardware or CPU Limit imposed by vSphere.
  • Reserved speed guarantee = 10% of configured.  This would be the CPU Reservation imposed by vSphere, and in theory in the minimum kept available for the VM as a whole, across all vCPUs.
  • CPU Ready time guaranteed to be below 100 milliseconds, as an average for each 20 second period.  This one is the real trick, in my view.  You can guarantee all the CPU Reservation you like, but if CPU Ready is holding high at some crazy value like 3000ms, the application will suck.  This can be a tricky one to back up, because there are many factors that contribute to CPU Ready time, not least being the number of multi-vCPU workloads created by the tenant/user that are competing for co-scheduled CPU time, and perhaps out of the hands of the infrastructure provider.  How this service level is maintained would be a combination of carefully monitoring the over-commitment of vCPUs, and the number of 2, 4, 8 or more vCPU VMs are supported in a cluster.  It would be prudent to prohibit certain sizes of VM for a given service, but perhaps this is truly just something that needs monitoring and reporting by the Operations team.

Memory


  • Reserved memory guarantee = 50% of configured.  Ensures that at least this much physical memory is present for the VM.  This one is reasonably uncomplicated, and imposed by vSphere for the VM.
  • Hypervisor maximum swap rate = 400 pages/second.  By this, I mean the disk swapping of VM memory handled by the hypervisor (ESXi), not the hypervisor's own memory, and not the disk swapping performed by the VM's guest operating system.  This might be worth further consideration, but it is a way to ensure that even though not all memory is guaranteed, the platform will have capacity to support the workloads as a whole without grinding to a halt.  My motivation for including this is due to extrapolating on the Memory Reservation being pushed to its defined limits.  Let's say that memory is indeed reserved for each workload at 50%, but the service provider only provisions exactly enough memory to meet that definition.  In other words, the rest of the memory space for VMs requires ESXi swapping to disk.  Now, it should be clear that the performance will be pitiful, and all tenants/users will rise up in revolt - but against what?  The provider has met their guarantees!  He didn't say your application was guaranteed to perform well, and how could he for your complicated application?  This is a harsh situation if the infrastructure provider is internal to the business and the tenant can't just jump ship.  The tenant WILL find a way to go elsewhere - so it must be fixed.  Anyhow, expressing a maximum swap rate is one way to provide a guarantee that, while the unreserved memory may be under contention across the workloads, it will only be up to a point.  This is enforced by the Operations team making the right choices for memory over-commitment on the platform, and monitoring it.  An early warning sign would be when Memory Ballooning activity starts to rise - ESXi is asking the VM guest OS for help!

Storage performance


  • Maximum disk operations per second = 400 per VM.  This is to ensure that there is a limit to noisy neighbours, and just good practice to ensure consistency of service.  Otherwise, early adopter customers will be disappointed when their initial blindingly-fast speed is diminished to "regular" levels as more customer workloads come on board.  vSphere Storage I/O Control can step in here.
  • Minimum guaranteed disk operations per second = 25 per VM.  This is just like CPU guarantees - this ensures that a VM will be at least able to hobble along with a certain amount of throughput.
  • Maximum storage latency guaranteed to be below 80 milliseconds, as an average over each 20 second period.  Again, this one for me is the real trick.  I have come across a few terrible environments over the years (storage vendors to remain unnamed!) which struggled desperately with their VMware environment because of storage latency.  Storage throughput is one thing, but if that throughput is happening behind a veil of sluggish response, the application will suck.  Think of it akin to your Internet speeds.  You might have amazing 150Mb/s sitting at your home office (I wish), but terrible response time (pings).  Great for watching buffered videos, but if you're playing Call of Duty you're going to get hammered.  OK, perhaps not the most business-relevant example.  How about this - if your atom-smasher application is producing gigabytes of output data for storage, high throughput is fine.  But if your database on an accompanying VM is doing lots of small operations that are each slowed down by poor disk latency, the database will suck, regardless of the throughput available.

Storage space


  • All configured storage will be available.  This might be needed to ensure that the tenant is not left high and dry because the service provider forgot to manage their thin provisioning in time!
  • This is also where broader service levels would be described around data backup and/or replication, data retention, and data recovery times.  In addition, if the context is an external Cloud Provider, you would also define data confidentiality, data erasure on VM deletion and/or service termination (a critical one, but usually absent), and data access by third parties.  It is worth being clear to define such things as legal jurisdiction, as the Australian Federal Police (and similar bodies in all countries) ultimately still have power to confiscate data in justified circumstances.  Department of Homeland Security and Anti-Terrorism laws didn't really change anything here, sorry guys.

Please keep in mind, that this is all in support of defining differentiated performance levels.  The point of defining any of the above numbers is to create a performance expectation (or even guarantee), sure, but also to enable an infrastructure provider to express the precise difference between what is meant by "Gold" and "Silver" services, for instance.  Thus, one set of performance numbers would be appropriate for the better service level, and a different set of numbers might support the lesser service level.

And don't forget, this is not necessarily about external service providers - all this should apply equally well to a business's own internal IT practice.  In fact, the business should first have some understanding of these numbers in order to understand what an external provider is offering, and how suitable the offering would be for the business.  How else can the business know whether is it going to win or lose by changing to or between service providers?  Price?  Bah.  The last thing I need is a bunch of unscrupulous cheap cloud providers creating a poor reputation for "going to the cloud".

I need to point out that some of these performance definitions can be controlled using technical platform features, and some can't - or might be difficult.  vSphere has some fantastic capabilities in all these areas, and these should be used to create a well understood performance tier.  Some of these capabilities are exposed to higher-level products such vCloud Director and vCloud Automation Center - but certainly not all.  This is where a clever IT practice will need to determine which can be integrated as part of the service, and perhaps even orchestrated using vCenter Orchestrator or similar mechanisms.  Other aspects of these guarantees will not be implemented as technical controls - CPU Ready time being one example.  It is just going to require a switched on Operations team that is aware of what they are managing, why they are managing those numbers, how they gain visibility to these things, and what to do about it when numbers are breached or under threat of a breach.

That's my quick list of some considerations for performance definitions that I think are glaring in their absence from service tier discussions.  Perhaps they exist, and I would love to hear from you if I have overlooked you or your favourite service provider.  In addition, I have only come up with my own thoughts on some very important numbers and considerations - but I am certainly no VCDX.  I am sure I have missed a bunch of additional metrics that are just as important, or perhaps supersede my suggestions.  Or perhaps, you disagree with what I have proposed above.  In any case, please raise your voice - either in comments below, or ping me a note!

Sunday, August 18, 2013

Charging your cloud customer costs they want to see

In an earlier blog, I described a simple way to start building a cost model based on infrastructure cost recovery.  This approach tried to account for costs actually incurred, so that you could have a good idea of who is using your infrastructure budget, and maybe have a basis for charging them.  Obviously, this is what all service providers need to understand, but it is also something that many IT practices are starting to tackle.

The problem with that approach, however, is that if you passed the resulting costs onto a customer based on the "actual cost" idea, you would be spending a lot of time arguing with your customer.  The customer (whether internal business or an external customer) would query how the hell they ended up with a bill of $5,127.63 this month, when last month it was $3,502.24.   This is a good lesson I learned from talking to service providers.  Customers are very gun-shy of the "shocking mobile phone bill" event.  Anyone paying for an IT service is VERY keen for it to be predictable, consistent from month to month, and easily understood.

So, how to marry these two ideas?  On the one hand, the provider side of the service has definite costs, and those costs need to be accurately tracked and apportioned out to the users of that service.  And on the other hand, the customer doesn't want to end up with highly fluctuating costs, unpredictable bills, and NEVER to receive some huge bill they can't explain.  The "inexplicable bill" is a problem, even when a customer is willing to deal with the varying bills.  What I mean by that is, when a bill comes and the customer queries it, how would the provider explain how those particular numbers got calculated?  Do you imagine a conversation where the customer is interested in how many CPU cycles were consumed, Bytes transferred, memory consumed on physical server, etc?  I can imagine a shouting match full of accusations, doubts, lack of trust, and ultimately an unhappy experience.

The insight from a few of my service provider friends was "the simple cost model".  I guess it's nothing magic - but it makes good sense as part of a "dual cost model".  One for internal validation of real costs incurred.  The other for something you can confidently invoice, explain and backup with simple data.

Take a cursory look around a few IaaS provider price lists, and you'll find some easy examples.  For instance, $35/month for a virtual machine of a standard size.  Other costs might include certain specific resources added to the VM configuration, like another vCPU or disk, or larger memory size.  I think the key to making this work simply is to limit the sizing choices available to the customer, so that prices increment is only a small number of known ways.  Then it is easy to look at a list of customer VMs, and without the aid of a spreadsheet or detailed metrics, you could quickly work out the cost of that setup.  Thus, a customer knows what they're up for, and can understand an invoice when it arrives, AND can have a conversation with the provider about it without it turning into a frustrated screaming match!

Now, I did mention before about how to marry this "outward cost" with the accurate internal cost.  This is a reconciliation exercise that should be done on a regular basis, to ensure that whatever the customers are being charged is as close as possible to the real costs incurred (and maintaining whatever profit margins to go with it).  You might find that the "incurred cost + margin" and the "invoiced revenue" would vary by 10% either way from month to month - but as long as the two numbers gravitate near each other over time, you are winning!  And as variances become clear, then the customer model might need some slight adjustment.  Why adjust only the customer model?  Because the internal model is based on real costs and should be the more accurate number, and the customer model is intentionally simplified and a bit artificial.

Well, hopefully those thoughts make sense.  One cost for keeping internal, and a simpler derived cost for charging on to your customers/users.  This approach should make sense regardless of whether you are providing services to your own business users, or to commercial customers.

Happy modeling, and as always please throw any comments back my way!

Tuesday, June 11, 2013

The Ever-Versatile vCenter Orchestrator


Solving a Simple Backup Problem 

vCenter Orchestrator is a bit of a dark horse in the VMware product portfolio.  Almost every customer has it, because it is licensed along with every vCenter Server, but almost every customer has never touched it.  It is actually a VERY powerful tool to have in your toolbox.
Of course, I’m a strong opponent against using orchestration-centric approaches to building a self-service and/or cloud environment.  The two-part problem is that
  1. Orchestration designs, “great” as they might be, need to touch many pieces of technology in the datacenter.  While no single element of integration may be a particular challenge, this up-front implementation of many moving parts is costly and time-consuming – and such solutions will typically take 6 months or more, and cost several hundreds of thousands of dollars in services alone.  
  2. Orchestration designs are hypersensitive to any technology changes over time. That is, you are likely to break the intricate machine whenever you perform a software upgrade, firmware upgrade or hardware model change. That is usually guaranteed to happen yearly (or half-yearly) for software, and every few years for hardware. Multiplied by 10-15 moving parts, or more, that means the solution is never actually stable for any length of time – or else it holds the business back from making necessary changes.
Well, having said all that, orchestration has its place. If it was on the food pyramid, it might be “fats and oils”. Rich in energy, necessary as part of a complete diet, but you have to take it easy or else it’ll lead to heart attack, or perhaps an inability to leave your front door!  OK, so it’s not the best analogy… but hopefully the point sticks.
Last year, a customer was seeking a way to achieve a simple backup method, to safeguard their remote offices from VMs breaking through software changes, updates or “tinkering”.  Each site had an ESXi standalone host managed by a central vCenter Server, local storage, and fair to poor WAN links.  We decided to explore using vCenter Orchestrator to create application consistent, on-site, self-managing backups.  What we ended up with looked pretty useful, so I thought I would share it here.   It also only took a few days for us to put together, Peter Marfatia and I, which I thought was pretty reasonable for a team with limited skills in the tool.

I have included a link to the resulting package at the bottom of this article, and also the automatically generated documentation that vCenter Orchestrator provided for me.

To get started with vCenter Orchestrator, there are some great resources out there – some I’ve included at the bottom of this article.  It is a great learning experience to install the Orchestrator Appliance and Client, and just look around at the various actions, workflows and tools that it makes available.

The Overall Backup Process

Perhaps to start with, a view of the overall process we used for this Branch Backup would help. 
  1. The workflow is pointed to a folder within vCenter
  2. It discovers all virtual machines within that folder, and determines whether they are candidates for backup, or are instances of prior backups.
  3. If requiring backup, it performs a snapshot with quiescing, then clones the snapshot to a new VM, which is converted to a template (to prevent accidental power-ons).
  4. If looking at prior backups, removes those no longer needed.
  5. When all backups have been processed, the workflow emails a report to a nominated address.

Backup Dispatcher

This is the main entry point into the whole workflow. 


You can see that in vCenter Orchestrator, there is a visual layout of the workflow steps, like all other orchestration tools.  Even if this is the first time you’ve seen orchestrator, you can look at the diagrams and have a fair understanding of what is happening when the workflow is run.

When running the job manually, the workflow asks a small number of questions as shown below.  This is the default user interface presented by the Orchestrator Client, and others are available – check out the VMware Labs site for some options.

Invoking the “Backup Dispatcher” workflow, this interface is asking for the following elements.
  • Email address to send the job report – listing the VMs backed up and the success/failure results.
  • Number of backups to retain – on a per VM basis.  This could be made a global property, but we had fun playing around with different levels here.  I wrote the retention logic to allow for changes to retention, so that during periods of greater change, more backups could be safely kept, and then scaled down later.
  • Folder containing the VMs to be backed up – the workflow would collect ALL VMs from the selected folder.  The management of what to backup is then a simple drag’n’drop of any VMs into or out of this folder.
We provide a few things in the static properties, such as mail server and content settings, but most other things are dynamic.  The static properties are present as “Workflow Attributes” – these are essentially working (read/write) variables that don’t act as workflow input (read only) or output (write only).  Before running the package, you will need to have your vCenter Server registered in vCenter Orchestrator, so that it appears within the vCO inventory and enables communication between the two.

Below is a screenshot of the inventory, as viewed from the vCO Client.  This is one of the excellent aspects of vCO – I can pre-configure what things are present in my environment and not have to deal with connection strings, user credentials, and various properties embedded in scripts.  It is done just once, and then the workflows can talk to your datacenter!  If you can’t see your vCenter inventory like this in vCenter Orchestrator, the workflow won’t be able to do much with your environment!

In my case, I have easily connected up:
  • 2 x vCenter servers
  • UCS Manager
  • Active Directory
  • vCloud Director
  • vCenter Chargeback
  • A mail server

When clicking on the hyperlink to point the workflow to a specific folder (initially has the value “Not set”), the workflow interface presents a view of the vCenter inventory (as seen by vCO) for you to choose from, as in the screenshot below.  Again, this is simpler for the workflow user, because I had already made vCenter available to vCO using a service account (although I could have forced a per-user connection if I wanted).

When scheduling the workflow to run on an automatic daily cycle, you can set this parameter during the scheduling process, in the same way as shown here.  In the case of our customer, they wanted to schedule a collection of these backup jobs daily, each pointing to slightly different sets of VMs to backup.  In the vCO Scheduler, each job entry was provided with the distinct folders that it would manage, and the jobs just ran thereafter without any real caretaking.

The first action of the Backup Dispatcher is to “Get All Virtual Machines By Folder Including Sub Folders”.  This is pretty self-explanatory, and was an action already available through the vSphere Plugin shipped with vCO.  Conveniently, this requires only the input folder, and returns an array containing all VMs found.

The next action is to “Sort VMs By Name”.  I implemented this as a subordinate workflow, while I was toying around with various ways to solve a key problem – which was how to determine whether the current backup is to be retained or not.  I wanted to ensure the workflow didn’t use any hard dates – as it doesn’t know if it is being run weekly or daily, or ad hoc, and a whole bunch of other “if’s” and “maybe’s” that came up while I was thinking about it.  Due to the limited amount of time I wanted to spend on it, and my rudimentary skills, I decided to name VM backups according to a certain naming pattern, which contains:
  • Original VM name
  • A known delimiter, which hopefully won’t pop-up in a normal VM name.  I chose a colon “:” in my example, after checking with the customer that they wouldn’t expect a problem.
  • The keyword “BACKUP
  • Another delimiter “:”
  • A date/time stamp, in the format yyyyMMddhhmm, such as 201306051609 – which is what my clock says as I write this.
This “Sort VMs By Name” action merely contains a Javascript function, made with a little help from web searching, which sorts VMs according to original name first, and then from most recent to least recent backup.  This helps later on, because the retention policy can skip over the ones to be retained, and delete any subsequent backups older than these.  You’ll see that later in the “Process backup and retention logic” workflow.  Anyway, the point being that this particular point in my effort probably took the longest amount of time as I struggled to remember anything at all about writing some script.  It just goes to show how little I needed to know for the rest of the effort!

The next stage of the main workflow is essentially a “for each” loop.  I implemented it as an explicit loop, just because the easy vCO “ForEach” logic control worked a little differently than I wanted it to here – but I could probably tackle this again in a better way.  For each of the VMs in the (now sorted) array, I submit them each to the “Process backup and retention logic” subordinate workflow.  This subordinate workflow will determine whether or not to backup the VM at hand, and if so, will return an identifier for the backup activity.

Once all VMs have been “processed”, the main workflow then waits for any backup activities that are being performed, using the workflow identifiers kept from each job submission, and then sends a report.  The main framework being used here was derived from some previous examples created by Joerg Lew. (Well, I think it was Joerg!)

OK, so that’s the main flow, but the cool bit is doing the live backup from a quiesced snapshot, so let’s look at that - in a minute.  First, we need to figure out what needs backing up.

Process Backup and Retention Logic

In hindsight, this is a crappy name, but this subordinate workflow is being called for each and every VM that was discovered, and is trying to determine whether this is a ‘real’ virtual machine needing backup, or if it’s a backup that might need to be removed.  This part of the workflow was where I did most of the thinking, trying different approaches that I could make work using my rudimentary skills.



You can also see the passing of inputs and outputs for this workflow, which is an awesome visualisation of where information is flowing.  It's also very easy to just click'n'drag this info around, as you're building the workflow.



Firstly, the workflow tries to separate the discovered VM name into the three separate elements, according to the “OriginalVMname:BACKUP:201306051409” type of format.  If this is not actually a backup, the last two elements will just be empty, of course.  If we find that the current VM is a new name, then any counters – which were being used to count the number of old backups – need to be reset to zero.  Then, if the VM has the special name “BACKUP” in it, then the workflow only needs to determine whether to keep it. 

These decisions were based on a couple of simple bits of Javascript logic – but you may notice that all the decisions are being made with vCO logic elements.  This is also another easy part of vCenter Orchestrator – you drop in an “IF” logic box, give it an input to determine a “true” or “false” choice, and then you simply drag a connection for each choice to the next part of the workflow.  Too easy.

The only smart bits I used in this whole workflow were:

  • A few Javascript bits of logic, which I could likely replace with more readable vCO logic elements
  • A call to a vCenter action to “Delete Virtual Machine”, if the workflow has found an old backup requiring deletion.
  • A function call to submit a new vCO workflow for any VMs found that need backing up.  This is done in the “Backup This VM” script action, and this is what returns the identifier for the running backup job that is tracked later on.  The workflow that we actually call is “Clone VM For Backup”, which is described next.

Clone VM For Backup

This is a pretty straightforward bit of work, which anybody could put together on their Day 2 exploration of vCenter Orchestrator.  It simply takes a VM as an input, and calls the workflows already available in the vSphere Plugin.


There is one element that is a little ‘special’ here, which is the “Clone From Snapshot” workflow.  After creating a snapshot, which is passed the “Quiesce=True” parameter, the next piece is to clone the still-running VM from that snapshot.  This workflow was taken from Joerg Lew from his years-old blog on this topic.  This is a native capability of vSphere, and the vSphere API, but just isn’t readily exposed through other means such as the vSphere Client. 

This workflow is also passed the parameter to make the new clone into a template, which helps avoid accidental power-on operations.  The new template is named according to the “OriginalVMname:BACKUP:yyyMMddhhmm” format mentioned earlier.

The quiescing behaviour called during the snapshot action is the native vSphere capability to invoke VSS for Windows machines, or look for scripting stubs in Linux machines.  It is then up to an application owner to determine if any special actions might be needed to ensure application consistency.  Whatever the owner decides, the workflow doesn’t need to worry about it.

The other “cool” thing I decided would make sense is to “Remove All Snapshots” once the clone has finished.  This was a clear decision that I thought would have the added benefit of ensuring snapshots disappeared on a regular basis.  I have fielded enough urgent calls from customers who have killed their environment because of snapshots filling up datastores. If this was deemed undesirable, however, the workflow could be modified to only remove the snapshot that was created in the prior step, using the available “Remove Snapshot” workflow instead.  The risk here is that something unexpected might prevent this clean up from happening one day, and the snapshot would be effectively “forgotten”.  Hence, my decision to remove all snapshots provides a nice safeguard.

Results

At the end of the process, I can look at the results of running this workflow in the environment.  Below is a view of the job mid-flight in a demo environment.

You can just see that the “Moodle” VM is still being cloned and hasn’t yet been turned into a template.  You can see, however, the other templates from earlier backups having finished – in the current schedule, and also an earlier call.

I put in a certain amount of logging in the vCO workflows, writing to the Server.log(“__“) very handy function call, such as the example below.
            Server.log("Submitting backup for: " + vm.name);

I have included the logging output below, to give you an idea of what this creates.

This solution took only a few days of playing, experimenting and learning.  A lot of the vCenter Orchestrator functionality is self-evident, or if not, it is very comprehensively documented.  The customer was very pleased with this simple approach to solving a simple problem, and we trod a fine line between simplicity and complexity, to ensure the customer could easily understand the results and own it without too much hassle.

I am certainly not suggesting that this is a great backup strategy for your organisation, and that isn’t really the point of sharing it here.  I have used this example of a quick and cheap solution to illuminate one way we have used vCenter Orchestrator.  There are many other use cases that I’m sure you will find, once you discover how excellent this tool is, that you probably already possess.

As I pointed out at the start of this article, technical architects can get carried away with orchestration, and many organisations build very complex systems using this approach.  The temptation is certainly there.  However, it is very sensitive to change.  The simple example here might be robust enough, because it is only talking to one element – vSphere.  But this would quickly become an unmanageable beast if we connected to a server platform, a storage platform, a network manager and a firewall system – just for example.  Each element either becomes frozen in time, or else creates a risk of breaking the orchestration workflows.

The abstraction delivered by virtualization solutions such as vCloud Director, and vCenter itself, introduces standardised, software-based interfaces to the datacenter.  Actions can then be controlled through these software interfaces by the tools’ native functions and policies.  This is the true value of the broader Software Defined Datacenter architecture.  For the large, complex enterprise, orchestration is still useful and necessary “glue” from time to time, and vCenter Orchestrator is a very powerful and friendly tool in this capacity.

Further Areas For Expansion

Thanks to Joerg Lew and Peter Marfatia for their contributions to putting this little solution together.  I also greatly appreciate the community leadership provided by Christophe Decanini and Burke Azbill, who contributed plenty of knowledge and examples on the web for me to follow.

The example given here is a quick run at a solution, and certainly has plenty of opportunities for improvement.  With additional time, I would probably replace some Javascript functions with vCenter Orchestrator logic elements, which would make the workflow easier to understand visually, and make the self-documentation more complete.  I would also re-visit the explicit loop I have used here, and find an elegant way to make use of the “ForEach” construct instead.

Resources

There are a bunch of resources that I have used over time, and that really help with getting an introduction to vCenter Orchestrator.  A couple of them are listed below, to start you on your way.


I have also uploaded my vCenter Orchestrator package and documentation at the links below.  Please feel free to use and abuse - and if you make it bugger and better, please share!

Thanks for reading!

Monday, May 13, 2013

Showing Back IT Costs to the Business - A Simple Cost Model for IaaS


Introduction

Let’s get it clear up front – I’ve only ever found two customers who actually have the ability (or interest) to “chargeback” IT costs to their business groups.  But, speaking in my role as a VMware techo, almost all customers are very interested to know how they can communicate IT costs to the rest of the business.  The problem is that this is not really something tackled by most IT groups, and it can be difficult to start, so I thought I’d share some thoughts, which have been helpful in my discussions so far.

The two questions most organisations seem to want answered is: "Where did the money go?"  And: "Who’s using all our stuff?"  This becomes so much more of a problem when establishing a shared virtual platform, and especially when talking about the path to a Private Cloud (or self-service) shared capability.

Perhaps you spent $100k on extra storage, and now it’s almost disappeared – but where?  Who has it?  Or perhaps you want to show real data to highlight the challenge that Project A is using up a ton of your IT capacity, and that is why you need to buy some more servers, or memory upgrades.  The reason for knowing this is that IT is usually begging for money, and the business can begrudge spending extra when it is hard to show why.

I love this quote from “The New CIO Leader
“Treated as overhead and often lacking even the most rudimentary chargeback systems, IT appears to be free and therefore without much value.  Lacking a price point to encourage responsible decision making, business leaders demand more service than they fund, creating a perpetual backlog of demand and placing IS in the position of continually saying no.”
The New CIO Leader
Dr Marianne Broadbent & Dr Ellen Kitzis
Harvard Business Press, 2004

In other words, IT keeps costing more money.  It looks like a burden, even if it is critical to the business.  And on the other hand, IT kind of appears to be “Free” (like beer), so why can’t I ask for 16 CPUs and 64GB memory in my new SQL database? After all, we now have a shared, pooled platform (thanks VMware!), so there is always a little spare capacity, and no reason I can’t have some, right?  Sound familiar?

What Does IT Cost?

It can be hard to draw the line on what you include under your cost models.  I think it is useful to start with technical resources – such as servers, licenses, storage and that sort of thing.  So other than staff, make a list of EVERYTHING that is involved in delivering a business application, and then we’ll find a way to allocate these costs to a metric, such as a VM hourly charge.  Having said that, don’t go crazy and turn this into an exercise for the Accounting Department just yet – keep it simple so you can get started.  Then you can refine.

CAVEAT
:  I have used Hourly costs throughout my example here, because I am building a model that can be easily plugged into vCenter Chargeback Manager’s hourly
 consumption metering.  If you would rather use a monthly/daily basis, then adjust accordingly.

A Simple/Starter Example

Step 1 – Find your costs

A simple list to start with might look like the following.  
  • Physical servers
  • Network/storage switches
  • Firewalls and other networking/security devices
  • Storage controllers and disks
  • Software licenses
  • Rack space, power and cooling 

Step 2 – When does it cost?

This is important – you need to know what makes your incur these costs.  Let’s take physical servers as a good example to start with.  The core question to ask is “What event or resource causes me to buy another server?”  For most customers, the answer is memory consumption.  When I look across recent deployments, CPU capacity seems to be in plentiful supply, and memory is still the most constrained resource.  So in most cases, a project asking for 16GB memory is going to tip me over into a new server purchase, rather than the 4, 8, or 16 vCPUs that came with the request.

Your environment might be different, but the process will be similar.

Perhaps a physical server costs $20,000, including hardware maintenance over its planned life.  And perhaps this server has 96GB memory installed in it.  So on a resource basis:    
  • Cost of server: $20,000
  • Memory capacity: 96GB  (the most constrained resource)
  • Upfront cost per GB: $20,000/96GB = $208.33 per GB.  

That’s an upfront purchase cost – but some VMs live for 4 months, others for 8 years, so how does this figure?  The answer is to have a time-based rate, such as an Hourly cost, applied against the life of the equipment.

Step 3 – How long does it last?

This is often missed – when will you buy a replacement for this component?  I spoke to a customer last year who charged projects per GB of storage, but only once, ever.  That meant that IT was responsible for the entire cost of a SAN refresh – not a pretty overhead, and again IT just looked like a dead weight every few years.

Going back to our physical server example, perhaps you plan to refresh your servers every 3 years.  That means the costs should be spread over that 3 year life, as below.  Don’t forget leap-years!
  • Cost of server: $20,000
  • Memory capacity: 96GB  (the most constrained resource)
  • Life of server: 3 years (1,095.75 days, or 26,298 hours!)
  • Hourly cost per GB: $20,000 / 96GB / 3 years = $0.007922/GB/hour.  
Fantastic! Now there’s an hourly rate for the cost of a physical server to support a VM using a memory consumption model!  At least, it’s something to start with.  How accurate is it?  Well, let’s work on that.

Step 4 – Refining the model

The reality is that cost models can get as complicated as you can tolerate.  Again, don’t go nuts, as usually the main goal here is to find out a relative measure of who is using what amount of resources, and a rough cost impact.  But let’s consider a few more things in our simple model.

In a VMware vSphere environment, there will be clusters of hosts, and very often in an “N+1” availability configuration.  That is, the resource consumption should go only so high as still being able to tolerate one host being down without impacting performance.  Also, we tend not to absolutely fill a server before considering it “full”.  Lastly, VMware resource management is pretty awesome, allowing safe over-commitment, and we can eke out much more resources than just the raw consumption or allocation we assume up-front.

These considerations so far boil down to a net impact on actual memory capacity of a server.  Instead of 96GB being “available”, we modify it as below.
  • Raw physical server capacity: 96GB
  • Clustering: 10 hosts (9 + 1, for High Availability)
  • Level of “fullness”: 85%
  • Degree of over-commitment: 20%
  • Adjusted memory capacity: 96GB * (9/10) * 85% * 120% = 88GB
  • Adjusted hourly cost per GB: $20,000/88GB/3 years = $0.008642/GB/hour. 
I raised an important consideration above, which is "consumption or allocation".  To some degree this depends a little on whether you are trying to account for used resources (consumption) or requested resources (allocation).  You will need to make some assumptions in either case.  But for internal cost reporting (as opposed to chargeback) I find consumption to be a more accurate representation of "where stuff goes".


Step 5 – Expanding to other costs

So, now do you just rinse and repeat for all the other costs?  Well, actually not for that many.  A good number of the “other” costs will also roll into the cost of a physical server.  How so?  I’m glad you asked – let’s look at some examples below.
  • VMware licensing for vSphere or vCloud Suite
    • Per physical CPU, usually 2 or 4 CPUs/licenses per server.
    • The “life span” of a perpetual software license is tricky to account for.  You may elect to recover the cost of the license over 3 years, even though you won’t be re-buying the license when you refresh the server.  Annual Support and Subscription (or Software Assurance for Windows) has no such complications.
  • Microsoft Windows Server licensing
    • If you’re smart, you’ll be using Datacenter Edition licensing, which is also per physical CPU
  • Blade chassis
    • Will be an added cost to spread over the blade servers that it supports, which may be 8 servers perhaps.
    • However, a chassis may survive 6 years, or 2 server refreshes, so we need to take that into account.
  • Network switches
    • Purchases usually tied to ports being populated, rather than bandwidth.  Again, we can equate this to a per server charge to some degree.
    • Let’s assume 48 ports.  Perhaps 6 ports per server (a false example if using blade servers, of course).  That breaks down to the cost of one switch per 8 servers, for example.
    • The same theory could apply to Fibre Channel switches for storage, if FC based.
    • Switches will probably be refreshed on a different cycle, as per the blade chassis above, so again needs to be considered.
  • Rack space, power and cooling
    • Again, this can be tied to physical equipment, although it will be a rough estimate.
If I add these all together using some rough numbers out of thin air, I get to $0.023167 per GB of virtual machine memory per hour.  It is interesting that this is nearly 3 times the raw server cost, but this single rate is now able to account for servers, platform licensing, switches and datacentre costs.

I can likewise apply this type of approach to the remaining components from our short list above.  Storage per GB is easy enough, but perhaps the reason for buying another shelf of disk is actually I/O performance – some thought is needed to account for IOPS versus space.  You can probably generalize across the environment as a whole to determine the most obvious metric to use, and treat exceptional workloads as just that – exceptions. 

Keeping It Simple

Firewalls and security services are interesting examples of complexity.  The tipping point for expanding a firewall/IDS to the next bigger model, or another node, is probably network throughput or perhaps the number of rules processed.  You can either meter the network I/O of VMs to account for these, or perhaps assume an average of 6 rules per VM and meter on that basis. 

By the time we’re digging around at this level, however, we’re really tweaking a much more detailed model.  The overall accuracy of the model is only going to change in small increments.  There needs to be a decision about how far down this path you progress.  The overall goal is to get a rough idea of cost allocation.  Accurately accounting for absolutely all costs is an exercise for another article, as this would incorporate non-technical costs such as staff labour, facilities, and maintenance activities, for example.

The model above has boiled down to a few simple rates, which you can plug in to vCenter Chargeback Manager, as per the below screenshot.  You now have a method of finding out where your costs have gone, after buying all your new shiny servers and disks, and who in your organization is consuming how much infrastructure.


Driving Behaviour

One of the strong motivators to introduce cost reporting to a business is to help the business understand the consequences of their IT requests.  The model above merely accounts for resources are they are likely to be consumed - again going back to the question of consumption versus allocation.  A savvy business user might notice that CPU allocation and usage are not charged at all.  So we come back to the earlier question posed – why shouldn’t Project A ask for 16 CPUs for all their virtual machines?

On a technical level, we know that uncontrolled over-allocation of vCPUs will start to cause poor performance, from CPU Ready Time particularly.  While the example model I’ve given here is probably enough for cost recovery, there is reason to include a metric to encourage good behavior – such as splitting costs between memory consumption and vCPU allocation.  This will help the IT Department have a fact-based conversation with Project A about reducing its current allocation of CPUs, or asking for less in the first place.  Exactly how you split the cost, or if perhaps it is an overlay/additional cost, is up to each organization to decide.  This ties into a future discussion of reporting on Cost Reporting, as we’ve done here, and Charging Back costs to the business groups.  That is a different topic, as we must consider “cost perception” as well as cost allocation.

Further Areas For Expansion

This article is long enough already.  So some areas that are left to future write-ups could include:
  • Cost Reporting reporting versus Charging Back to business groups
  • Accounting for the entire IT budget, beyond the technical resources
Have any other thoughts?  Or perhaps you disagree with my approach here?  Please let me know and write a comment below!