Monday, August 25, 2014

Report on View session history - How busy is my lab?

I was asked a while ago to help justify spending some money on our shared lab environment, which we use for customer demonstrations.  The question really was "How much do people use the demonstration lab, really?"  So, I thought there must be a way within VMware View to help figure it out.  There was, and it got turned into a pretty little graph, as I'll show you below.

The first part is to extract the session data.  In View's database is a table called "View_Events", which logs all sorts of things about the environment (read KB article here).  For my interests, I noticed it logged and event when a user got connected, and when they disconnected.  I didn't care about when they were logged in, because most of us use the lab for short sharp demos, and tend to leave our desktops logged in all the time, and then quickly jump in to do a demonstration, and then jump off again.  It was only the ACTIVE session count and length of stay that I wanted to know. In particular:
  • How frequently were people connecting to the lab, and
  • How long were people in session for (quick demo, longer demo, or working on something bigger like an all-day marketing event)
There was no natural way to show session duration, but the data can be derived from the session ID being attached to both the CONNECT event and the DISCONNECT event.  These show up as login/logout events on the broker - "BROKER_USERLOGGEDIN" and "BROKER_USERLOGGEDOUT". Have a long look at my query below, and it'll make sense when you compare it with the results.
USE View_Events
SELECT LoginEvents.UserDisplayName AS Username, LoginEvents.EventID AS LoginEventID, LoginEvents.Time AS TimeIn, LogoutEvents.EventID AS LogoutEventID, LogoutEvents.Time AS TimeOut, LoginData.StrValue AS SessionId, LogoutEvents.Time - LoginEvents.Time AS SessionTime
FROM VE_user_events_hist AS LoginEvents INNER JOIN VE_event_data_historical AS LoginData ON LoginEvents.EventID = LoginData.EventID INNER JOIN VE_event_data_historical AS LogoutData ON LoginData.StrValue = LogoutData.StrValue INNER JOIN VE_user_events_hist AS LogoutEvents ON LogoutData.EventID = LogoutEvents.EventID
WHERE (LoginEvents.Module = N'Broker') AND (LoginEvents.EventType = N'BROKER_USERLOGGEDIN') AND (LoginData.Name = N'BrokerSessionId') AND (LogoutData.Name = N'BrokerSessionId') AND (LogoutEvents.Module = N'Broker') AND (LogoutEvents.Module = N'Broker') AND (LogoutEvents.EventType = N'BROKER_USERLOGGEDOUT')
ORDER BY LoginEvents.Time DESC
If you are good at reading SQL, you might notice that the last column selected, which I call "SessionTime".  This is actually a SQL calculation of logout time minus login time.  This tells me how long the person was connected for.

The results look a bit like the below, when extracted as raw CSV.
MELB\nwheat,28366,2013-11-20 21:47:39.513,28370,2013-11-20 22:07:45.483,e2fc4503_1633_4634_8e3b_bdd8dc098438,1900-01-01 00:20:05.970

Putting it through the Excel wringer, I turned it into something a LOT more palatable, as below.  I also add some further calculated fields, which helps me turn it into a pretty PivotChart. The bolded fields are the ones I used for my report.
  • Username
    • I performed a find and replace to remove the unneeded 'DOMAIN\' part.
  • LoginEvent
    • Completely ignored field, but is the ID for the BROKER_USERLOGGEDIN event.
  • LoginTime
    • Event timestamp, ignored hereafter.
  • LogoutEvent
    • Completely ignored field, but is the ID for the BROKER_USERLOGGEDOUT event.
  • LogoutTime
    • Event timestamp, ignored hereafter.
  • SessionID
    • This magic field is present for both the Login and Logout event and connects the login/logout events together so I can calculate the session duration!
  • SessionTime
    • This is the field calculated in the SQL query.  Unfortunately, it comes through as a timestamp, which Excel displays as "one day plus the calculated time", so I convert it below.
  • SessionLength
    • Added in Excel to remove the additional "day" in the timestamp above.
    • Formula is {"=[@SessionTime]-1"}
  • Short
    • Added in Excel, to be "1" if the SessionLength is less than 30 minutes (1 day / 48)
    • Formula is {"=IF(([@SessionLength]<(1/48)),1,0)"}
  • Medium
    • Added in Excel, to be "1" if the SessionLength is more than 30 minutes but less than 90 minutes.
    • Formula is {"=IF((AND([@SessionLength]>=(1/48),[@SessionLength]<(1/16))),1,0)"}
  • Long
    • Added in Excel, to be "1" if the SessionLength is more than 90 minutes.
    • Formula is {"=IF(([@SessionLength]>=(1/16)),1,0)"}
  • Month
    • Added in Excel, month extracted for my PivotChart later.
    • Formula is {"=(MONTH([@LoginTime]))"}
  • Year
    • Added in Excel, month extracted for my PivotChart later.
    • Formula is {"=(YEAR([@LoginTime]))"}

Phew!  That was a bunch of playing around, and surely someone can quickly make a template or macro out of this.  But seeing as I only need to churn out a report every 6  months or so, I haven't bothered as yet.  The resulting table looks like the below.

Username LoginEvent   LoginTime LogoutEvent LogoutTime SessionID SessionTime SessionLength Short Medium Long Month Year
asingleton 1794 30/6/14 10:58 PM 1796 30/6/14 11:07 PM 7f6bfad3_7000_4d69_88d2_1c17e7276e02 24:08:34 0:08:34 1 0 0 6 2014
gorchard 1790 30/6/14 8:49 PM 1800 1/7/14 12:36 AM ceb28d1a_8bad_47cb_80c4_a793e9dc42ce 27:47:22 3:47:22 0 0 1 6 2014

This told me everything I need to know, but it's not really in "Management language" - by which I mean a pretty graph!  The last part is to quickly turn this into a PivotChart using the Excel wizards.  The only difference being that I interpret "short" sessions to be "Quick demo", "medium" to be "Full demo", and "long" to be "Event or workshop".

The pretty output of my report is shown below.  Vertical axis is session count, and horizontal axis is the first six months of this year.



This did in fact result in some money being spent on our environment, and partly because I was able to show the usage frequency, and indicate what kind of thing people are doing in our View environment.  I hope you've found this useful, and that you can do this directly yourself in your environment, or enhance this report even further with a bit of tinkering.  Please contact me if you'd like the sample file that goes with this (although I'm sure you can re-create), or if you have a better version you'd like to contribute back!

Monday, January 20, 2014

VCAP-CID and VCAP-DCD exam experiences

I recently studied and passed the VMware Advanced Certified Professional (VCAP) level exams for both Cloud Infrastructure Design (VCAP-CID) and Data Center Design (VCAP-DCD).  It was an interesting experience, and well worth the process for any considering them.

I was part of a study group with @GrantOrchard and @Josh_Odgers, and those guys really helped me focus on a study structure.  In both exams, our general approach was to work through the Exam Blueprint, which has been the strong recommendation by VMware Education and pretty much everyone else who cares to comment!  I can validate that approach.  Using the Blueprint, you know exactly what you're going to be examined on, and it will draw your attention to your weaker areas.

My natural behaviour was to skim through the blueprint and ummm and ahhh at each section, thinking about what it meant.  In my more productive moments, I would feel a little nervous and be prompted into a sideline of reading the Best Practice papers for the area.  When getting into the group, however, this casual attitude was firmed up into some very useful whiteboarding exercises.  Some of the exercises were drawing up a table - like a permission or role matrix.  This was the content which I just had to memorise, as the "logic" you might use in your own workplace could be irrelevant for the exam.  The exam is looking for certain organisational roles, and choices of permission management, which may have no bearing on how YOU would do it in real life.

For the VCAP-CID, almost everything you need to know is in the vCAT (plus a little bit of Chargeback docs).

The absolute BEST exercises were modelling the design scenarios, of which there are a few in each exam.  These take some time to answer, and are not always easy to draw in the exam, so having them clear in your mind is a great start.  For each of the study areas, we would try to imagine "What would be a design exercise for this?" - and then try to draw out a scenario.  This was awesome in a group, because inevitably it would start a discussion or argument about exactly what might be asked, and exactly what a "good answer" might contain.  For me, this is where the rubber hit the road.  Once we had a good set of scenarios we had worked through, we would then be in a good position to wonder how it might expand/change, or if we were missing a scenario so far.

My VCAP Design study tips


  • Find some colleagues, and study in a group.  If you're semi-confident, then a minimum of 5-6 sessions would be a good idea.  
  • Follow the Exam Blueprint as your study blueprint.  It tells you what you will be asked.  Pore over it and make sure you could answer questions about all the areas.
  • If you're feeling weak in areas, download the Best Practice whitepaper for the topic and make sure you understand the content, and why recommendations are given.  I found the below ones the most useful for me.
  • Get a whiteboard.  A big one!
  • For each study area, pick two design scenarios you think might come up and work through them in the group.  Nut it out, argue about it, ask what else they might want.  Don't forget you'll be starting from Business Requirements, so that has to be the start of the scenario!

Are you ready?

Well, that is something you will only know AFTER the exam!  However, my guidance on the exams is below.  Do take it with a grain of salt, as everyone will have a different experience.

VCAP-DCD

In my opinion, this exam was quite good, and not particularly scary.  Trying to sit down and focus for 3.5 hours is the biggest problem for candidates.  I reckon I was mentally "done" by about 2 hours into it.  So a VERY good sleep the night before is recommended - don't sit up studying late the night before.  It will hurt you.

Anyone that has a VCP-DV has the vSphere technical knowledge, I think.  In addition to this, the rest of the knowledge comes from using that feature set with customers' real problems.  I would say that if you have been spending the last 2-3 years implementing vSphere or acting in a (modestly detailed) technical pre-sales capacity for vSphere solutions, then you should be right to go.

I have recommended that ALL my VMware SE colleagues just go for it.  Both @GrantOrchard and @DemitasseNZ recommended that I just book it without study.  While I did study beforehand, it probably didn't help much - they were right.

VCAP-CID

Now, this one is a different kettle of fish.  This exam was challenging on more ways than one.  I believe it is perhaps still a little new in the sense that the exam content is nowhere near as refined/evolved as the DCD exam.  I found quite a few of the questions were either (a) unanswerable because of vagueness or ambiguity, or (b) unanswerable because the answers presented all seemed wrong.  I am happy to trust that the exam could be 100% correct and I am a bit thick, but for some questions I was still unable to derive a correct answer even several days later, when thinking back on it.

The other challenge was that at least one of the design scenarios seemed to be broken. That is, I couldn't correctly connect up the elements, no matter what I tried.  Perhaps I was marked correctly anyway, but I doubt it - and the drawing tool just would not play ball.

I failed my first attempt at this one (VCAP-CID), and so I can verify that it was not a one-off problem.  I also noticed that the question pool must be quite small, so unfortunately my second sitting was probably a bit unfair as I found myself in front of a lot of familiar (and previously considered) questions.

Failure?

My biggest factors in failing the VCAP-CID exam the first time around were two that are well known among all other candidates:

  • Lack of sleep the night before.  I had trouble sleeping due to an unrelated event, so while I got to bed early, I spent a lot of time listening to the night pass by!  This meant my brain was struggling to concentrate, and most questions took 2-3 reads before I understood what was needed.
  • Time management.  I ran out of time.  In fact, on a "pro-rata" basis, I got about the same score on both attempts, for the questions I got to.  The first time I got nearly three quarters through, and just failed by a few points.  The second attempt I finished quite early, and passed pretty well.  I thought I was managing OK the first time, but obviously my brain was operating at an entirely too slow rate!  (Due to the first factor above).

Conclusion

If you've been a "vSphere guy/gal" for a while, and kept your VCP status up to date, just do the VCAP-DCD.  It's a fine exam, and is a good test of what you're probably doing day to day anyway.

If you attempt the VCAP-CID, be prepared for a poorer quality exam, and a lower achievable score (I think).  And keep the vCAT close...

Good luck!



Monday, September 9, 2013

Performance definitions that should exist

When it comes to Cloud Computing, understanding service and performance definitions is a key hurdle, and one that is usually new to organisation looking to make a move to a Private Cloud, or perhaps considering adoption of a Cloud Provider.

When Cloud Computing was first being talked about, and particularly Private Cloud situations, I recall VMware sales teams (including me) illustrating service tier distinctions based on such things as uptime.  For example, tier 1 workloads get vSphere HA, and not tiers 2 or 3.  But looking back on this, whatever was real the operational cost of having vSphere HA enabled?  Certainly nothing to do with human effort.  Maybe some extra capital cost for some redundant hardware, but that should exist in all vSphere environments anyway.  I haven't seen any vSphere implement that intentionally build in NO redundant Ethernet/FC links, or host failover capacity.  It just isn't sensible to do this for the mere sake of creating a lesser tier of infrastructure.

The next key candidate for definitions used with tiers of service was performance, which is the topic of this blog article.  This implied an expensive and high performance tier, versus a cheaper but lower performance tier.  

The example representation of these performance tiers was storage types.  That is, tier 1 applications would be put on fast disks, maybe some nice 15K RPM Fibre Channel drives, and tier 2 applications would be put on relatively slower, cheaper disks such as some 7200 RPM SATA drives.  OK, that sounds somewhat justifiable.  But what happens if the fast disk was hopelessly over-committed (because in the absence of an effective cost model, everyone selected the "best" levels!) and the tier 1 applications got terrible performance.  Meanwhile, the "cheap" SATA drives might have been under-used, or lucky enough to have non-demanding applications, and performed the same as the tier 1 disks?  Whoops.  Something is clearly missing from these common examples of performance definitions.

Here's my proposal: define performance in terms of guarantees of response from the infrastructure.  Specifically, what I mean is that when an application needs to perform disk I/O, it gets a certain response time.  When it needs access to CPU or memory, it does or doesn't have to wait/contend for these resources.  Putting this into "VMware engineer" speak, or my version of it, the performance definition might contain terms such as the list below.  Obviously, the actual numbers here are fictitious and would need to reflect a real SLA.

CPU  


  • Maximum/peak speed guaranteed = 2.2GHz.  This would be max vCPU speed, matched or bettered by the underlying hardware or CPU Limit imposed by vSphere.
  • Reserved speed guarantee = 10% of configured.  This would be the CPU Reservation imposed by vSphere, and in theory in the minimum kept available for the VM as a whole, across all vCPUs.
  • CPU Ready time guaranteed to be below 100 milliseconds, as an average for each 20 second period.  This one is the real trick, in my view.  You can guarantee all the CPU Reservation you like, but if CPU Ready is holding high at some crazy value like 3000ms, the application will suck.  This can be a tricky one to back up, because there are many factors that contribute to CPU Ready time, not least being the number of multi-vCPU workloads created by the tenant/user that are competing for co-scheduled CPU time, and perhaps out of the hands of the infrastructure provider.  How this service level is maintained would be a combination of carefully monitoring the over-commitment of vCPUs, and the number of 2, 4, 8 or more vCPU VMs are supported in a cluster.  It would be prudent to prohibit certain sizes of VM for a given service, but perhaps this is truly just something that needs monitoring and reporting by the Operations team.

Memory


  • Reserved memory guarantee = 50% of configured.  Ensures that at least this much physical memory is present for the VM.  This one is reasonably uncomplicated, and imposed by vSphere for the VM.
  • Hypervisor maximum swap rate = 400 pages/second.  By this, I mean the disk swapping of VM memory handled by the hypervisor (ESXi), not the hypervisor's own memory, and not the disk swapping performed by the VM's guest operating system.  This might be worth further consideration, but it is a way to ensure that even though not all memory is guaranteed, the platform will have capacity to support the workloads as a whole without grinding to a halt.  My motivation for including this is due to extrapolating on the Memory Reservation being pushed to its defined limits.  Let's say that memory is indeed reserved for each workload at 50%, but the service provider only provisions exactly enough memory to meet that definition.  In other words, the rest of the memory space for VMs requires ESXi swapping to disk.  Now, it should be clear that the performance will be pitiful, and all tenants/users will rise up in revolt - but against what?  The provider has met their guarantees!  He didn't say your application was guaranteed to perform well, and how could he for your complicated application?  This is a harsh situation if the infrastructure provider is internal to the business and the tenant can't just jump ship.  The tenant WILL find a way to go elsewhere - so it must be fixed.  Anyhow, expressing a maximum swap rate is one way to provide a guarantee that, while the unreserved memory may be under contention across the workloads, it will only be up to a point.  This is enforced by the Operations team making the right choices for memory over-commitment on the platform, and monitoring it.  An early warning sign would be when Memory Ballooning activity starts to rise - ESXi is asking the VM guest OS for help!

Storage performance


  • Maximum disk operations per second = 400 per VM.  This is to ensure that there is a limit to noisy neighbours, and just good practice to ensure consistency of service.  Otherwise, early adopter customers will be disappointed when their initial blindingly-fast speed is diminished to "regular" levels as more customer workloads come on board.  vSphere Storage I/O Control can step in here.
  • Minimum guaranteed disk operations per second = 25 per VM.  This is just like CPU guarantees - this ensures that a VM will be at least able to hobble along with a certain amount of throughput.
  • Maximum storage latency guaranteed to be below 80 milliseconds, as an average over each 20 second period.  Again, this one for me is the real trick.  I have come across a few terrible environments over the years (storage vendors to remain unnamed!) which struggled desperately with their VMware environment because of storage latency.  Storage throughput is one thing, but if that throughput is happening behind a veil of sluggish response, the application will suck.  Think of it akin to your Internet speeds.  You might have amazing 150Mb/s sitting at your home office (I wish), but terrible response time (pings).  Great for watching buffered videos, but if you're playing Call of Duty you're going to get hammered.  OK, perhaps not the most business-relevant example.  How about this - if your atom-smasher application is producing gigabytes of output data for storage, high throughput is fine.  But if your database on an accompanying VM is doing lots of small operations that are each slowed down by poor disk latency, the database will suck, regardless of the throughput available.

Storage space


  • All configured storage will be available.  This might be needed to ensure that the tenant is not left high and dry because the service provider forgot to manage their thin provisioning in time!
  • This is also where broader service levels would be described around data backup and/or replication, data retention, and data recovery times.  In addition, if the context is an external Cloud Provider, you would also define data confidentiality, data erasure on VM deletion and/or service termination (a critical one, but usually absent), and data access by third parties.  It is worth being clear to define such things as legal jurisdiction, as the Australian Federal Police (and similar bodies in all countries) ultimately still have power to confiscate data in justified circumstances.  Department of Homeland Security and Anti-Terrorism laws didn't really change anything here, sorry guys.

Please keep in mind, that this is all in support of defining differentiated performance levels.  The point of defining any of the above numbers is to create a performance expectation (or even guarantee), sure, but also to enable an infrastructure provider to express the precise difference between what is meant by "Gold" and "Silver" services, for instance.  Thus, one set of performance numbers would be appropriate for the better service level, and a different set of numbers might support the lesser service level.

And don't forget, this is not necessarily about external service providers - all this should apply equally well to a business's own internal IT practice.  In fact, the business should first have some understanding of these numbers in order to understand what an external provider is offering, and how suitable the offering would be for the business.  How else can the business know whether is it going to win or lose by changing to or between service providers?  Price?  Bah.  The last thing I need is a bunch of unscrupulous cheap cloud providers creating a poor reputation for "going to the cloud".

I need to point out that some of these performance definitions can be controlled using technical platform features, and some can't - or might be difficult.  vSphere has some fantastic capabilities in all these areas, and these should be used to create a well understood performance tier.  Some of these capabilities are exposed to higher-level products such vCloud Director and vCloud Automation Center - but certainly not all.  This is where a clever IT practice will need to determine which can be integrated as part of the service, and perhaps even orchestrated using vCenter Orchestrator or similar mechanisms.  Other aspects of these guarantees will not be implemented as technical controls - CPU Ready time being one example.  It is just going to require a switched on Operations team that is aware of what they are managing, why they are managing those numbers, how they gain visibility to these things, and what to do about it when numbers are breached or under threat of a breach.

That's my quick list of some considerations for performance definitions that I think are glaring in their absence from service tier discussions.  Perhaps they exist, and I would love to hear from you if I have overlooked you or your favourite service provider.  In addition, I have only come up with my own thoughts on some very important numbers and considerations - but I am certainly no VCDX.  I am sure I have missed a bunch of additional metrics that are just as important, or perhaps supersede my suggestions.  Or perhaps, you disagree with what I have proposed above.  In any case, please raise your voice - either in comments below, or ping me a note!

Sunday, August 18, 2013

Charging your cloud customer costs they want to see

In an earlier blog, I described a simple way to start building a cost model based on infrastructure cost recovery.  This approach tried to account for costs actually incurred, so that you could have a good idea of who is using your infrastructure budget, and maybe have a basis for charging them.  Obviously, this is what all service providers need to understand, but it is also something that many IT practices are starting to tackle.

The problem with that approach, however, is that if you passed the resulting costs onto a customer based on the "actual cost" idea, you would be spending a lot of time arguing with your customer.  The customer (whether internal business or an external customer) would query how the hell they ended up with a bill of $5,127.63 this month, when last month it was $3,502.24.   This is a good lesson I learned from talking to service providers.  Customers are very gun-shy of the "shocking mobile phone bill" event.  Anyone paying for an IT service is VERY keen for it to be predictable, consistent from month to month, and easily understood.

So, how to marry these two ideas?  On the one hand, the provider side of the service has definite costs, and those costs need to be accurately tracked and apportioned out to the users of that service.  And on the other hand, the customer doesn't want to end up with highly fluctuating costs, unpredictable bills, and NEVER to receive some huge bill they can't explain.  The "inexplicable bill" is a problem, even when a customer is willing to deal with the varying bills.  What I mean by that is, when a bill comes and the customer queries it, how would the provider explain how those particular numbers got calculated?  Do you imagine a conversation where the customer is interested in how many CPU cycles were consumed, Bytes transferred, memory consumed on physical server, etc?  I can imagine a shouting match full of accusations, doubts, lack of trust, and ultimately an unhappy experience.

The insight from a few of my service provider friends was "the simple cost model".  I guess it's nothing magic - but it makes good sense as part of a "dual cost model".  One for internal validation of real costs incurred.  The other for something you can confidently invoice, explain and backup with simple data.

Take a cursory look around a few IaaS provider price lists, and you'll find some easy examples.  For instance, $35/month for a virtual machine of a standard size.  Other costs might include certain specific resources added to the VM configuration, like another vCPU or disk, or larger memory size.  I think the key to making this work simply is to limit the sizing choices available to the customer, so that prices increment is only a small number of known ways.  Then it is easy to look at a list of customer VMs, and without the aid of a spreadsheet or detailed metrics, you could quickly work out the cost of that setup.  Thus, a customer knows what they're up for, and can understand an invoice when it arrives, AND can have a conversation with the provider about it without it turning into a frustrated screaming match!

Now, I did mention before about how to marry this "outward cost" with the accurate internal cost.  This is a reconciliation exercise that should be done on a regular basis, to ensure that whatever the customers are being charged is as close as possible to the real costs incurred (and maintaining whatever profit margins to go with it).  You might find that the "incurred cost + margin" and the "invoiced revenue" would vary by 10% either way from month to month - but as long as the two numbers gravitate near each other over time, you are winning!  And as variances become clear, then the customer model might need some slight adjustment.  Why adjust only the customer model?  Because the internal model is based on real costs and should be the more accurate number, and the customer model is intentionally simplified and a bit artificial.

Well, hopefully those thoughts make sense.  One cost for keeping internal, and a simpler derived cost for charging on to your customers/users.  This approach should make sense regardless of whether you are providing services to your own business users, or to commercial customers.

Happy modeling, and as always please throw any comments back my way!

Tuesday, June 11, 2013

The Ever-Versatile vCenter Orchestrator


Solving a Simple Backup Problem 

vCenter Orchestrator is a bit of a dark horse in the VMware product portfolio.  Almost every customer has it, because it is licensed along with every vCenter Server, but almost every customer has never touched it.  It is actually a VERY powerful tool to have in your toolbox.
Of course, I’m a strong opponent against using orchestration-centric approaches to building a self-service and/or cloud environment.  The two-part problem is that
  1. Orchestration designs, “great” as they might be, need to touch many pieces of technology in the datacenter.  While no single element of integration may be a particular challenge, this up-front implementation of many moving parts is costly and time-consuming – and such solutions will typically take 6 months or more, and cost several hundreds of thousands of dollars in services alone.  
  2. Orchestration designs are hypersensitive to any technology changes over time. That is, you are likely to break the intricate machine whenever you perform a software upgrade, firmware upgrade or hardware model change. That is usually guaranteed to happen yearly (or half-yearly) for software, and every few years for hardware. Multiplied by 10-15 moving parts, or more, that means the solution is never actually stable for any length of time – or else it holds the business back from making necessary changes.
Well, having said all that, orchestration has its place. If it was on the food pyramid, it might be “fats and oils”. Rich in energy, necessary as part of a complete diet, but you have to take it easy or else it’ll lead to heart attack, or perhaps an inability to leave your front door!  OK, so it’s not the best analogy… but hopefully the point sticks.
Last year, a customer was seeking a way to achieve a simple backup method, to safeguard their remote offices from VMs breaking through software changes, updates or “tinkering”.  Each site had an ESXi standalone host managed by a central vCenter Server, local storage, and fair to poor WAN links.  We decided to explore using vCenter Orchestrator to create application consistent, on-site, self-managing backups.  What we ended up with looked pretty useful, so I thought I would share it here.   It also only took a few days for us to put together, Peter Marfatia and I, which I thought was pretty reasonable for a team with limited skills in the tool.

I have included a link to the resulting package at the bottom of this article, and also the automatically generated documentation that vCenter Orchestrator provided for me.

To get started with vCenter Orchestrator, there are some great resources out there – some I’ve included at the bottom of this article.  It is a great learning experience to install the Orchestrator Appliance and Client, and just look around at the various actions, workflows and tools that it makes available.

The Overall Backup Process

Perhaps to start with, a view of the overall process we used for this Branch Backup would help. 
  1. The workflow is pointed to a folder within vCenter
  2. It discovers all virtual machines within that folder, and determines whether they are candidates for backup, or are instances of prior backups.
  3. If requiring backup, it performs a snapshot with quiescing, then clones the snapshot to a new VM, which is converted to a template (to prevent accidental power-ons).
  4. If looking at prior backups, removes those no longer needed.
  5. When all backups have been processed, the workflow emails a report to a nominated address.

Backup Dispatcher

This is the main entry point into the whole workflow. 


You can see that in vCenter Orchestrator, there is a visual layout of the workflow steps, like all other orchestration tools.  Even if this is the first time you’ve seen orchestrator, you can look at the diagrams and have a fair understanding of what is happening when the workflow is run.

When running the job manually, the workflow asks a small number of questions as shown below.  This is the default user interface presented by the Orchestrator Client, and others are available – check out the VMware Labs site for some options.

Invoking the “Backup Dispatcher” workflow, this interface is asking for the following elements.
  • Email address to send the job report – listing the VMs backed up and the success/failure results.
  • Number of backups to retain – on a per VM basis.  This could be made a global property, but we had fun playing around with different levels here.  I wrote the retention logic to allow for changes to retention, so that during periods of greater change, more backups could be safely kept, and then scaled down later.
  • Folder containing the VMs to be backed up – the workflow would collect ALL VMs from the selected folder.  The management of what to backup is then a simple drag’n’drop of any VMs into or out of this folder.
We provide a few things in the static properties, such as mail server and content settings, but most other things are dynamic.  The static properties are present as “Workflow Attributes” – these are essentially working (read/write) variables that don’t act as workflow input (read only) or output (write only).  Before running the package, you will need to have your vCenter Server registered in vCenter Orchestrator, so that it appears within the vCO inventory and enables communication between the two.

Below is a screenshot of the inventory, as viewed from the vCO Client.  This is one of the excellent aspects of vCO – I can pre-configure what things are present in my environment and not have to deal with connection strings, user credentials, and various properties embedded in scripts.  It is done just once, and then the workflows can talk to your datacenter!  If you can’t see your vCenter inventory like this in vCenter Orchestrator, the workflow won’t be able to do much with your environment!

In my case, I have easily connected up:
  • 2 x vCenter servers
  • UCS Manager
  • Active Directory
  • vCloud Director
  • vCenter Chargeback
  • A mail server

When clicking on the hyperlink to point the workflow to a specific folder (initially has the value “Not set”), the workflow interface presents a view of the vCenter inventory (as seen by vCO) for you to choose from, as in the screenshot below.  Again, this is simpler for the workflow user, because I had already made vCenter available to vCO using a service account (although I could have forced a per-user connection if I wanted).

When scheduling the workflow to run on an automatic daily cycle, you can set this parameter during the scheduling process, in the same way as shown here.  In the case of our customer, they wanted to schedule a collection of these backup jobs daily, each pointing to slightly different sets of VMs to backup.  In the vCO Scheduler, each job entry was provided with the distinct folders that it would manage, and the jobs just ran thereafter without any real caretaking.

The first action of the Backup Dispatcher is to “Get All Virtual Machines By Folder Including Sub Folders”.  This is pretty self-explanatory, and was an action already available through the vSphere Plugin shipped with vCO.  Conveniently, this requires only the input folder, and returns an array containing all VMs found.

The next action is to “Sort VMs By Name”.  I implemented this as a subordinate workflow, while I was toying around with various ways to solve a key problem – which was how to determine whether the current backup is to be retained or not.  I wanted to ensure the workflow didn’t use any hard dates – as it doesn’t know if it is being run weekly or daily, or ad hoc, and a whole bunch of other “if’s” and “maybe’s” that came up while I was thinking about it.  Due to the limited amount of time I wanted to spend on it, and my rudimentary skills, I decided to name VM backups according to a certain naming pattern, which contains:
  • Original VM name
  • A known delimiter, which hopefully won’t pop-up in a normal VM name.  I chose a colon “:” in my example, after checking with the customer that they wouldn’t expect a problem.
  • The keyword “BACKUP
  • Another delimiter “:”
  • A date/time stamp, in the format yyyyMMddhhmm, such as 201306051609 – which is what my clock says as I write this.
This “Sort VMs By Name” action merely contains a Javascript function, made with a little help from web searching, which sorts VMs according to original name first, and then from most recent to least recent backup.  This helps later on, because the retention policy can skip over the ones to be retained, and delete any subsequent backups older than these.  You’ll see that later in the “Process backup and retention logic” workflow.  Anyway, the point being that this particular point in my effort probably took the longest amount of time as I struggled to remember anything at all about writing some script.  It just goes to show how little I needed to know for the rest of the effort!

The next stage of the main workflow is essentially a “for each” loop.  I implemented it as an explicit loop, just because the easy vCO “ForEach” logic control worked a little differently than I wanted it to here – but I could probably tackle this again in a better way.  For each of the VMs in the (now sorted) array, I submit them each to the “Process backup and retention logic” subordinate workflow.  This subordinate workflow will determine whether or not to backup the VM at hand, and if so, will return an identifier for the backup activity.

Once all VMs have been “processed”, the main workflow then waits for any backup activities that are being performed, using the workflow identifiers kept from each job submission, and then sends a report.  The main framework being used here was derived from some previous examples created by Joerg Lew. (Well, I think it was Joerg!)

OK, so that’s the main flow, but the cool bit is doing the live backup from a quiesced snapshot, so let’s look at that - in a minute.  First, we need to figure out what needs backing up.

Process Backup and Retention Logic

In hindsight, this is a crappy name, but this subordinate workflow is being called for each and every VM that was discovered, and is trying to determine whether this is a ‘real’ virtual machine needing backup, or if it’s a backup that might need to be removed.  This part of the workflow was where I did most of the thinking, trying different approaches that I could make work using my rudimentary skills.



You can also see the passing of inputs and outputs for this workflow, which is an awesome visualisation of where information is flowing.  It's also very easy to just click'n'drag this info around, as you're building the workflow.



Firstly, the workflow tries to separate the discovered VM name into the three separate elements, according to the “OriginalVMname:BACKUP:201306051409” type of format.  If this is not actually a backup, the last two elements will just be empty, of course.  If we find that the current VM is a new name, then any counters – which were being used to count the number of old backups – need to be reset to zero.  Then, if the VM has the special name “BACKUP” in it, then the workflow only needs to determine whether to keep it. 

These decisions were based on a couple of simple bits of Javascript logic – but you may notice that all the decisions are being made with vCO logic elements.  This is also another easy part of vCenter Orchestrator – you drop in an “IF” logic box, give it an input to determine a “true” or “false” choice, and then you simply drag a connection for each choice to the next part of the workflow.  Too easy.

The only smart bits I used in this whole workflow were:

  • A few Javascript bits of logic, which I could likely replace with more readable vCO logic elements
  • A call to a vCenter action to “Delete Virtual Machine”, if the workflow has found an old backup requiring deletion.
  • A function call to submit a new vCO workflow for any VMs found that need backing up.  This is done in the “Backup This VM” script action, and this is what returns the identifier for the running backup job that is tracked later on.  The workflow that we actually call is “Clone VM For Backup”, which is described next.

Clone VM For Backup

This is a pretty straightforward bit of work, which anybody could put together on their Day 2 exploration of vCenter Orchestrator.  It simply takes a VM as an input, and calls the workflows already available in the vSphere Plugin.


There is one element that is a little ‘special’ here, which is the “Clone From Snapshot” workflow.  After creating a snapshot, which is passed the “Quiesce=True” parameter, the next piece is to clone the still-running VM from that snapshot.  This workflow was taken from Joerg Lew from his years-old blog on this topic.  This is a native capability of vSphere, and the vSphere API, but just isn’t readily exposed through other means such as the vSphere Client. 

This workflow is also passed the parameter to make the new clone into a template, which helps avoid accidental power-on operations.  The new template is named according to the “OriginalVMname:BACKUP:yyyMMddhhmm” format mentioned earlier.

The quiescing behaviour called during the snapshot action is the native vSphere capability to invoke VSS for Windows machines, or look for scripting stubs in Linux machines.  It is then up to an application owner to determine if any special actions might be needed to ensure application consistency.  Whatever the owner decides, the workflow doesn’t need to worry about it.

The other “cool” thing I decided would make sense is to “Remove All Snapshots” once the clone has finished.  This was a clear decision that I thought would have the added benefit of ensuring snapshots disappeared on a regular basis.  I have fielded enough urgent calls from customers who have killed their environment because of snapshots filling up datastores. If this was deemed undesirable, however, the workflow could be modified to only remove the snapshot that was created in the prior step, using the available “Remove Snapshot” workflow instead.  The risk here is that something unexpected might prevent this clean up from happening one day, and the snapshot would be effectively “forgotten”.  Hence, my decision to remove all snapshots provides a nice safeguard.

Results

At the end of the process, I can look at the results of running this workflow in the environment.  Below is a view of the job mid-flight in a demo environment.

You can just see that the “Moodle” VM is still being cloned and hasn’t yet been turned into a template.  You can see, however, the other templates from earlier backups having finished – in the current schedule, and also an earlier call.

I put in a certain amount of logging in the vCO workflows, writing to the Server.log(“__“) very handy function call, such as the example below.
            Server.log("Submitting backup for: " + vm.name);

I have included the logging output below, to give you an idea of what this creates.

This solution took only a few days of playing, experimenting and learning.  A lot of the vCenter Orchestrator functionality is self-evident, or if not, it is very comprehensively documented.  The customer was very pleased with this simple approach to solving a simple problem, and we trod a fine line between simplicity and complexity, to ensure the customer could easily understand the results and own it without too much hassle.

I am certainly not suggesting that this is a great backup strategy for your organisation, and that isn’t really the point of sharing it here.  I have used this example of a quick and cheap solution to illuminate one way we have used vCenter Orchestrator.  There are many other use cases that I’m sure you will find, once you discover how excellent this tool is, that you probably already possess.

As I pointed out at the start of this article, technical architects can get carried away with orchestration, and many organisations build very complex systems using this approach.  The temptation is certainly there.  However, it is very sensitive to change.  The simple example here might be robust enough, because it is only talking to one element – vSphere.  But this would quickly become an unmanageable beast if we connected to a server platform, a storage platform, a network manager and a firewall system – just for example.  Each element either becomes frozen in time, or else creates a risk of breaking the orchestration workflows.

The abstraction delivered by virtualization solutions such as vCloud Director, and vCenter itself, introduces standardised, software-based interfaces to the datacenter.  Actions can then be controlled through these software interfaces by the tools’ native functions and policies.  This is the true value of the broader Software Defined Datacenter architecture.  For the large, complex enterprise, orchestration is still useful and necessary “glue” from time to time, and vCenter Orchestrator is a very powerful and friendly tool in this capacity.

Further Areas For Expansion

Thanks to Joerg Lew and Peter Marfatia for their contributions to putting this little solution together.  I also greatly appreciate the community leadership provided by Christophe Decanini and Burke Azbill, who contributed plenty of knowledge and examples on the web for me to follow.

The example given here is a quick run at a solution, and certainly has plenty of opportunities for improvement.  With additional time, I would probably replace some Javascript functions with vCenter Orchestrator logic elements, which would make the workflow easier to understand visually, and make the self-documentation more complete.  I would also re-visit the explicit loop I have used here, and find an elegant way to make use of the “ForEach” construct instead.

Resources

There are a bunch of resources that I have used over time, and that really help with getting an introduction to vCenter Orchestrator.  A couple of them are listed below, to start you on your way.


I have also uploaded my vCenter Orchestrator package and documentation at the links below.  Please feel free to use and abuse - and if you make it bugger and better, please share!

Thanks for reading!