Monday, September 9, 2013

Performance definitions that should exist

When it comes to Cloud Computing, understanding service and performance definitions is a key hurdle, and one that is usually new to organisation looking to make a move to a Private Cloud, or perhaps considering adoption of a Cloud Provider.

When Cloud Computing was first being talked about, and particularly Private Cloud situations, I recall VMware sales teams (including me) illustrating service tier distinctions based on such things as uptime.  For example, tier 1 workloads get vSphere HA, and not tiers 2 or 3.  But looking back on this, whatever was real the operational cost of having vSphere HA enabled?  Certainly nothing to do with human effort.  Maybe some extra capital cost for some redundant hardware, but that should exist in all vSphere environments anyway.  I haven't seen any vSphere implement that intentionally build in NO redundant Ethernet/FC links, or host failover capacity.  It just isn't sensible to do this for the mere sake of creating a lesser tier of infrastructure.

The next key candidate for definitions used with tiers of service was performance, which is the topic of this blog article.  This implied an expensive and high performance tier, versus a cheaper but lower performance tier.  

The example representation of these performance tiers was storage types.  That is, tier 1 applications would be put on fast disks, maybe some nice 15K RPM Fibre Channel drives, and tier 2 applications would be put on relatively slower, cheaper disks such as some 7200 RPM SATA drives.  OK, that sounds somewhat justifiable.  But what happens if the fast disk was hopelessly over-committed (because in the absence of an effective cost model, everyone selected the "best" levels!) and the tier 1 applications got terrible performance.  Meanwhile, the "cheap" SATA drives might have been under-used, or lucky enough to have non-demanding applications, and performed the same as the tier 1 disks?  Whoops.  Something is clearly missing from these common examples of performance definitions.

Here's my proposal: define performance in terms of guarantees of response from the infrastructure.  Specifically, what I mean is that when an application needs to perform disk I/O, it gets a certain response time.  When it needs access to CPU or memory, it does or doesn't have to wait/contend for these resources.  Putting this into "VMware engineer" speak, or my version of it, the performance definition might contain terms such as the list below.  Obviously, the actual numbers here are fictitious and would need to reflect a real SLA.

CPU  


  • Maximum/peak speed guaranteed = 2.2GHz.  This would be max vCPU speed, matched or bettered by the underlying hardware or CPU Limit imposed by vSphere.
  • Reserved speed guarantee = 10% of configured.  This would be the CPU Reservation imposed by vSphere, and in theory in the minimum kept available for the VM as a whole, across all vCPUs.
  • CPU Ready time guaranteed to be below 100 milliseconds, as an average for each 20 second period.  This one is the real trick, in my view.  You can guarantee all the CPU Reservation you like, but if CPU Ready is holding high at some crazy value like 3000ms, the application will suck.  This can be a tricky one to back up, because there are many factors that contribute to CPU Ready time, not least being the number of multi-vCPU workloads created by the tenant/user that are competing for co-scheduled CPU time, and perhaps out of the hands of the infrastructure provider.  How this service level is maintained would be a combination of carefully monitoring the over-commitment of vCPUs, and the number of 2, 4, 8 or more vCPU VMs are supported in a cluster.  It would be prudent to prohibit certain sizes of VM for a given service, but perhaps this is truly just something that needs monitoring and reporting by the Operations team.

Memory


  • Reserved memory guarantee = 50% of configured.  Ensures that at least this much physical memory is present for the VM.  This one is reasonably uncomplicated, and imposed by vSphere for the VM.
  • Hypervisor maximum swap rate = 400 pages/second.  By this, I mean the disk swapping of VM memory handled by the hypervisor (ESXi), not the hypervisor's own memory, and not the disk swapping performed by the VM's guest operating system.  This might be worth further consideration, but it is a way to ensure that even though not all memory is guaranteed, the platform will have capacity to support the workloads as a whole without grinding to a halt.  My motivation for including this is due to extrapolating on the Memory Reservation being pushed to its defined limits.  Let's say that memory is indeed reserved for each workload at 50%, but the service provider only provisions exactly enough memory to meet that definition.  In other words, the rest of the memory space for VMs requires ESXi swapping to disk.  Now, it should be clear that the performance will be pitiful, and all tenants/users will rise up in revolt - but against what?  The provider has met their guarantees!  He didn't say your application was guaranteed to perform well, and how could he for your complicated application?  This is a harsh situation if the infrastructure provider is internal to the business and the tenant can't just jump ship.  The tenant WILL find a way to go elsewhere - so it must be fixed.  Anyhow, expressing a maximum swap rate is one way to provide a guarantee that, while the unreserved memory may be under contention across the workloads, it will only be up to a point.  This is enforced by the Operations team making the right choices for memory over-commitment on the platform, and monitoring it.  An early warning sign would be when Memory Ballooning activity starts to rise - ESXi is asking the VM guest OS for help!

Storage performance


  • Maximum disk operations per second = 400 per VM.  This is to ensure that there is a limit to noisy neighbours, and just good practice to ensure consistency of service.  Otherwise, early adopter customers will be disappointed when their initial blindingly-fast speed is diminished to "regular" levels as more customer workloads come on board.  vSphere Storage I/O Control can step in here.
  • Minimum guaranteed disk operations per second = 25 per VM.  This is just like CPU guarantees - this ensures that a VM will be at least able to hobble along with a certain amount of throughput.
  • Maximum storage latency guaranteed to be below 80 milliseconds, as an average over each 20 second period.  Again, this one for me is the real trick.  I have come across a few terrible environments over the years (storage vendors to remain unnamed!) which struggled desperately with their VMware environment because of storage latency.  Storage throughput is one thing, but if that throughput is happening behind a veil of sluggish response, the application will suck.  Think of it akin to your Internet speeds.  You might have amazing 150Mb/s sitting at your home office (I wish), but terrible response time (pings).  Great for watching buffered videos, but if you're playing Call of Duty you're going to get hammered.  OK, perhaps not the most business-relevant example.  How about this - if your atom-smasher application is producing gigabytes of output data for storage, high throughput is fine.  But if your database on an accompanying VM is doing lots of small operations that are each slowed down by poor disk latency, the database will suck, regardless of the throughput available.

Storage space


  • All configured storage will be available.  This might be needed to ensure that the tenant is not left high and dry because the service provider forgot to manage their thin provisioning in time!
  • This is also where broader service levels would be described around data backup and/or replication, data retention, and data recovery times.  In addition, if the context is an external Cloud Provider, you would also define data confidentiality, data erasure on VM deletion and/or service termination (a critical one, but usually absent), and data access by third parties.  It is worth being clear to define such things as legal jurisdiction, as the Australian Federal Police (and similar bodies in all countries) ultimately still have power to confiscate data in justified circumstances.  Department of Homeland Security and Anti-Terrorism laws didn't really change anything here, sorry guys.

Please keep in mind, that this is all in support of defining differentiated performance levels.  The point of defining any of the above numbers is to create a performance expectation (or even guarantee), sure, but also to enable an infrastructure provider to express the precise difference between what is meant by "Gold" and "Silver" services, for instance.  Thus, one set of performance numbers would be appropriate for the better service level, and a different set of numbers might support the lesser service level.

And don't forget, this is not necessarily about external service providers - all this should apply equally well to a business's own internal IT practice.  In fact, the business should first have some understanding of these numbers in order to understand what an external provider is offering, and how suitable the offering would be for the business.  How else can the business know whether is it going to win or lose by changing to or between service providers?  Price?  Bah.  The last thing I need is a bunch of unscrupulous cheap cloud providers creating a poor reputation for "going to the cloud".

I need to point out that some of these performance definitions can be controlled using technical platform features, and some can't - or might be difficult.  vSphere has some fantastic capabilities in all these areas, and these should be used to create a well understood performance tier.  Some of these capabilities are exposed to higher-level products such vCloud Director and vCloud Automation Center - but certainly not all.  This is where a clever IT practice will need to determine which can be integrated as part of the service, and perhaps even orchestrated using vCenter Orchestrator or similar mechanisms.  Other aspects of these guarantees will not be implemented as technical controls - CPU Ready time being one example.  It is just going to require a switched on Operations team that is aware of what they are managing, why they are managing those numbers, how they gain visibility to these things, and what to do about it when numbers are breached or under threat of a breach.

That's my quick list of some considerations for performance definitions that I think are glaring in their absence from service tier discussions.  Perhaps they exist, and I would love to hear from you if I have overlooked you or your favourite service provider.  In addition, I have only come up with my own thoughts on some very important numbers and considerations - but I am certainly no VCDX.  I am sure I have missed a bunch of additional metrics that are just as important, or perhaps supersede my suggestions.  Or perhaps, you disagree with what I have proposed above.  In any case, please raise your voice - either in comments below, or ping me a note!