Monday, May 13, 2013

Showing Back IT Costs to the Business - A Simple Cost Model for IaaS


Introduction

Let’s get it clear up front – I’ve only ever found two customers who actually have the ability (or interest) to “chargeback” IT costs to their business groups.  But, speaking in my role as a VMware techo, almost all customers are very interested to know how they can communicate IT costs to the rest of the business.  The problem is that this is not really something tackled by most IT groups, and it can be difficult to start, so I thought I’d share some thoughts, which have been helpful in my discussions so far.

The two questions most organisations seem to want answered is: "Where did the money go?"  And: "Who’s using all our stuff?"  This becomes so much more of a problem when establishing a shared virtual platform, and especially when talking about the path to a Private Cloud (or self-service) shared capability.

Perhaps you spent $100k on extra storage, and now it’s almost disappeared – but where?  Who has it?  Or perhaps you want to show real data to highlight the challenge that Project A is using up a ton of your IT capacity, and that is why you need to buy some more servers, or memory upgrades.  The reason for knowing this is that IT is usually begging for money, and the business can begrudge spending extra when it is hard to show why.

I love this quote from “The New CIO Leader
“Treated as overhead and often lacking even the most rudimentary chargeback systems, IT appears to be free and therefore without much value.  Lacking a price point to encourage responsible decision making, business leaders demand more service than they fund, creating a perpetual backlog of demand and placing IS in the position of continually saying no.”
The New CIO Leader
Dr Marianne Broadbent & Dr Ellen Kitzis
Harvard Business Press, 2004

In other words, IT keeps costing more money.  It looks like a burden, even if it is critical to the business.  And on the other hand, IT kind of appears to be “Free” (like beer), so why can’t I ask for 16 CPUs and 64GB memory in my new SQL database? After all, we now have a shared, pooled platform (thanks VMware!), so there is always a little spare capacity, and no reason I can’t have some, right?  Sound familiar?

What Does IT Cost?

It can be hard to draw the line on what you include under your cost models.  I think it is useful to start with technical resources – such as servers, licenses, storage and that sort of thing.  So other than staff, make a list of EVERYTHING that is involved in delivering a business application, and then we’ll find a way to allocate these costs to a metric, such as a VM hourly charge.  Having said that, don’t go crazy and turn this into an exercise for the Accounting Department just yet – keep it simple so you can get started.  Then you can refine.

CAVEAT
:  I have used Hourly costs throughout my example here, because I am building a model that can be easily plugged into vCenter Chargeback Manager’s hourly
 consumption metering.  If you would rather use a monthly/daily basis, then adjust accordingly.

A Simple/Starter Example

Step 1 – Find your costs

A simple list to start with might look like the following.  
  • Physical servers
  • Network/storage switches
  • Firewalls and other networking/security devices
  • Storage controllers and disks
  • Software licenses
  • Rack space, power and cooling 

Step 2 – When does it cost?

This is important – you need to know what makes your incur these costs.  Let’s take physical servers as a good example to start with.  The core question to ask is “What event or resource causes me to buy another server?”  For most customers, the answer is memory consumption.  When I look across recent deployments, CPU capacity seems to be in plentiful supply, and memory is still the most constrained resource.  So in most cases, a project asking for 16GB memory is going to tip me over into a new server purchase, rather than the 4, 8, or 16 vCPUs that came with the request.

Your environment might be different, but the process will be similar.

Perhaps a physical server costs $20,000, including hardware maintenance over its planned life.  And perhaps this server has 96GB memory installed in it.  So on a resource basis:    
  • Cost of server: $20,000
  • Memory capacity: 96GB  (the most constrained resource)
  • Upfront cost per GB: $20,000/96GB = $208.33 per GB.  

That’s an upfront purchase cost – but some VMs live for 4 months, others for 8 years, so how does this figure?  The answer is to have a time-based rate, such as an Hourly cost, applied against the life of the equipment.

Step 3 – How long does it last?

This is often missed – when will you buy a replacement for this component?  I spoke to a customer last year who charged projects per GB of storage, but only once, ever.  That meant that IT was responsible for the entire cost of a SAN refresh – not a pretty overhead, and again IT just looked like a dead weight every few years.

Going back to our physical server example, perhaps you plan to refresh your servers every 3 years.  That means the costs should be spread over that 3 year life, as below.  Don’t forget leap-years!
  • Cost of server: $20,000
  • Memory capacity: 96GB  (the most constrained resource)
  • Life of server: 3 years (1,095.75 days, or 26,298 hours!)
  • Hourly cost per GB: $20,000 / 96GB / 3 years = $0.007922/GB/hour.  
Fantastic! Now there’s an hourly rate for the cost of a physical server to support a VM using a memory consumption model!  At least, it’s something to start with.  How accurate is it?  Well, let’s work on that.

Step 4 – Refining the model

The reality is that cost models can get as complicated as you can tolerate.  Again, don’t go nuts, as usually the main goal here is to find out a relative measure of who is using what amount of resources, and a rough cost impact.  But let’s consider a few more things in our simple model.

In a VMware vSphere environment, there will be clusters of hosts, and very often in an “N+1” availability configuration.  That is, the resource consumption should go only so high as still being able to tolerate one host being down without impacting performance.  Also, we tend not to absolutely fill a server before considering it “full”.  Lastly, VMware resource management is pretty awesome, allowing safe over-commitment, and we can eke out much more resources than just the raw consumption or allocation we assume up-front.

These considerations so far boil down to a net impact on actual memory capacity of a server.  Instead of 96GB being “available”, we modify it as below.
  • Raw physical server capacity: 96GB
  • Clustering: 10 hosts (9 + 1, for High Availability)
  • Level of “fullness”: 85%
  • Degree of over-commitment: 20%
  • Adjusted memory capacity: 96GB * (9/10) * 85% * 120% = 88GB
  • Adjusted hourly cost per GB: $20,000/88GB/3 years = $0.008642/GB/hour. 
I raised an important consideration above, which is "consumption or allocation".  To some degree this depends a little on whether you are trying to account for used resources (consumption) or requested resources (allocation).  You will need to make some assumptions in either case.  But for internal cost reporting (as opposed to chargeback) I find consumption to be a more accurate representation of "where stuff goes".


Step 5 – Expanding to other costs

So, now do you just rinse and repeat for all the other costs?  Well, actually not for that many.  A good number of the “other” costs will also roll into the cost of a physical server.  How so?  I’m glad you asked – let’s look at some examples below.
  • VMware licensing for vSphere or vCloud Suite
    • Per physical CPU, usually 2 or 4 CPUs/licenses per server.
    • The “life span” of a perpetual software license is tricky to account for.  You may elect to recover the cost of the license over 3 years, even though you won’t be re-buying the license when you refresh the server.  Annual Support and Subscription (or Software Assurance for Windows) has no such complications.
  • Microsoft Windows Server licensing
    • If you’re smart, you’ll be using Datacenter Edition licensing, which is also per physical CPU
  • Blade chassis
    • Will be an added cost to spread over the blade servers that it supports, which may be 8 servers perhaps.
    • However, a chassis may survive 6 years, or 2 server refreshes, so we need to take that into account.
  • Network switches
    • Purchases usually tied to ports being populated, rather than bandwidth.  Again, we can equate this to a per server charge to some degree.
    • Let’s assume 48 ports.  Perhaps 6 ports per server (a false example if using blade servers, of course).  That breaks down to the cost of one switch per 8 servers, for example.
    • The same theory could apply to Fibre Channel switches for storage, if FC based.
    • Switches will probably be refreshed on a different cycle, as per the blade chassis above, so again needs to be considered.
  • Rack space, power and cooling
    • Again, this can be tied to physical equipment, although it will be a rough estimate.
If I add these all together using some rough numbers out of thin air, I get to $0.023167 per GB of virtual machine memory per hour.  It is interesting that this is nearly 3 times the raw server cost, but this single rate is now able to account for servers, platform licensing, switches and datacentre costs.

I can likewise apply this type of approach to the remaining components from our short list above.  Storage per GB is easy enough, but perhaps the reason for buying another shelf of disk is actually I/O performance – some thought is needed to account for IOPS versus space.  You can probably generalize across the environment as a whole to determine the most obvious metric to use, and treat exceptional workloads as just that – exceptions. 

Keeping It Simple

Firewalls and security services are interesting examples of complexity.  The tipping point for expanding a firewall/IDS to the next bigger model, or another node, is probably network throughput or perhaps the number of rules processed.  You can either meter the network I/O of VMs to account for these, or perhaps assume an average of 6 rules per VM and meter on that basis. 

By the time we’re digging around at this level, however, we’re really tweaking a much more detailed model.  The overall accuracy of the model is only going to change in small increments.  There needs to be a decision about how far down this path you progress.  The overall goal is to get a rough idea of cost allocation.  Accurately accounting for absolutely all costs is an exercise for another article, as this would incorporate non-technical costs such as staff labour, facilities, and maintenance activities, for example.

The model above has boiled down to a few simple rates, which you can plug in to vCenter Chargeback Manager, as per the below screenshot.  You now have a method of finding out where your costs have gone, after buying all your new shiny servers and disks, and who in your organization is consuming how much infrastructure.


Driving Behaviour

One of the strong motivators to introduce cost reporting to a business is to help the business understand the consequences of their IT requests.  The model above merely accounts for resources are they are likely to be consumed - again going back to the question of consumption versus allocation.  A savvy business user might notice that CPU allocation and usage are not charged at all.  So we come back to the earlier question posed – why shouldn’t Project A ask for 16 CPUs for all their virtual machines?

On a technical level, we know that uncontrolled over-allocation of vCPUs will start to cause poor performance, from CPU Ready Time particularly.  While the example model I’ve given here is probably enough for cost recovery, there is reason to include a metric to encourage good behavior – such as splitting costs between memory consumption and vCPU allocation.  This will help the IT Department have a fact-based conversation with Project A about reducing its current allocation of CPUs, or asking for less in the first place.  Exactly how you split the cost, or if perhaps it is an overlay/additional cost, is up to each organization to decide.  This ties into a future discussion of reporting on Cost Reporting, as we’ve done here, and Charging Back costs to the business groups.  That is a different topic, as we must consider “cost perception” as well as cost allocation.

Further Areas For Expansion

This article is long enough already.  So some areas that are left to future write-ups could include:
  • Cost Reporting reporting versus Charging Back to business groups
  • Accounting for the entire IT budget, beyond the technical resources
Have any other thoughts?  Or perhaps you disagree with my approach here?  Please let me know and write a comment below!



2 comments:

  1. Nice work. I like to use the analogy of buying a car. If you were not responsible for the costs then most people would go for a fancy sports car over a 4 door sedan.

    What you describe here is cost visibility. The next big challenge is cost recovery that to be fair is less of an IT issue but something that needs to be addressed through business processes.

    ReplyDelete