Systems Administration – Spooky Solutions

Most of you who come here know that I post A LOT about the #vCommunity. What you may not know is that I actually have a day job. Who would have thought? In between being a dad, and a husband, and a VMUG Leader in NYC, I’m also a Solutions Engineer for Zerto. I’ve been in this role for over a year and I love it. I really enjoy speaking to customers and learning different and innovative ways of doing things.

As part of this totally awesome gig, I get to talk to a lot of customers and prospective customers about their disaster recovery (DR) and business continuity (BC) plans and approaches. One of my favorite questions to ask is this:

“How many people went into IT to become a DR admin?”

I usually get crickets. Not because it’s a bad thing to be in BC/DR (I make a great living from it), but because it’s not a sexy job. It’s usually a task that gets dumped in your lap for legal or compliance reasons and it takes you away from the things that you WANT to do. It usually involved getting a whole bunch of different teams (Virtualization, Storage, Networking, DBAs, App/Dev, etc) involved and spending a few weeks (usually a few months) preparing for a test that is almost always done during a (holiday) weekend. Who the hell wants to work on the weekend? I sure don’t, that’s why I made the move to the vendor side but that’s a whole other story.

There’s one other component that I haven’t mentioned yet. Runbooks. Ugh. Just the thought of those things make me cringe. Who remembers or still uses those huge loose-leaf binders with hundreds of pages of step by step instructions that were written (and probably not updated) years ago. Once a year you would have to dust them off for instructions on how to recover your environment in the event of a disaster. Then you would have to go page by page with a bunch of other team members and hope that the system matches what is on the page.

You know what is really helpful with this kind of situation? The simple acronym RTFM. I come from the military and this acronym had a very simple meaning

READ

THE

F*ING

MANUAL

That however, is the old RTFM.

Since working at Zerto, I’ve come up with a new meaning.

RECOVER

TEST

FAILOVER

MOVE

These are some of the essential functions that any IT Resiliency Platform should be able to provide you. By performing these functions, you’ll ensure that your workloads are protected, your data is intact, and your processes are valid. Let’s take a quick look at each of these functions:

RECOVER

This is the ability to restore your data. It could be restoring, or as we like to say resuming your VMs or applications. Or it could mean restoring files or folders from a point in time before a disruption.

TEST

Testing is probably one of the most important but also most overlooked operations when it comes to IT Resilience. Testing is how you know with great confidence that your systems will work when you attempt to get them back up and running. It’s a way to recover your VMs or applications in practice before having to do the real thing.

FAILOVER

Failover is a misleading term. This is actually recovering your VMs or applications at the target site. Think of this as initiating your DR plan in a live scenario. If your production site becomes unavailable for whatever reason, this is how you recover your workloads and make your users happy again. Simply put, when you’re down, get yourself back up and running.

MOVE

Zerto has a function called Move VPG which provides you with Application Mobility by migrating a Virtual Protection Group (VPG) to another location. (NOTE: A VPG is comprised of the VMs you are protecting) This could be moving to another storage platform, or another datacenter, moving from one hypervisor to another or even moving to, from or between cloud providers.

In order to have a complete IT Resilience platform, I believe you need to be able to perform all of these functions simply and consistently. Stay tuned as I will dive into each operation a bit more and how Zerto specifically performs each function.

Stop me if you’ve heard this one before. It’s Monday morning, you open up your newsreader and there are 25 different articles about an exploit that has been found that is sweeping the net. It affects nearly 90% of systems out there. You know it’s only a matter of time until this news goes from being on the tech sites only to the Wall Street Journal and The New York Times. Once that happens, the alert level hits red. Now all of your C-Level execs are aware of the problem and someone is going to be calling you asking for a status update. If you’re Peter Gibbons, you may even have 8 different people calling you. Where do you go from here? In the old days, this would mean, any plans you had for that weekend were scrapped. You’d now have to coordinate outages with your application teams, IT staff, sometimes you’d even have to get your building’s security team involved. You’d also have to break the news to your wife, husband, girlfriend, boyfriend, kids, or whoever that you may not see them again until Tuesday (assuming that all goes well). Then you get to go through this scenario:

Planning and executing the downing all of your affected DEV/TEST systems.
Preemptively opening cases with your vendors in case you run into an issue (you would hate to get stuck in the queue without a case number while your systems are down)
Downloading and applying the patches to fix the vulnerability.
Bringing all of said systems back up and running.
Contacting all of your applications owners once the systems are back up and having them test all of the applications.
Squeeze in a phone call to your loved ones asking about how life is on the outside.
Notifying all of your users that the systems are back up and running and that now regular weekend work can commence.
Once all of this is done, and you’ve verified that everything is OK and there are no issues, you can now plan to do the same thing to your Production systems. YAY! That usually means another weekend down the toilet.

Many times, some of the pain involved with this type of maintenance can be lessened through mechanisms like vMotion, Exchange DAGs, and clustered systems in general. Typically, you patch each of the secondary nodes in the cluster, then you patch the primary node and you’re good to go. This process of upgrading different cluster nodes can take hours depending on the size of your environment and requires total concentration and focus. If you run into an issue during a failover, you’ll be happy you opened that support case.

Why do I bring all of this up? Traditionally, the one system that usually has the biggest issues during this kind of upgrade/update scenario is your storage environment. Especially if you are on legacy storage for one reason or another. In most cases that I have seen, storage code upgrades are completely ignored unless absolutely necessary. I can see why people make that argument. If your storage goes down, especially in a small to medium sized shop, EVERYTHING goes down. This scares the pants off of a lot people, with good reason. They would rather take the “If it ain’t broke, don’t fix it.” approach of yesteryear. Nobody wants to run into those kinds of problems and lose their weekends because of storage issues. This kind of thinking leads to rolling the dice and hoping that the storage environment will just keep on chugging along and that no one will exploit the vulnerabilities that are out there. I think this model is changing in storage though, along the same lines that the break/fix mentality was replaced with a proactive approach. IT departments are getting more sophisticated and are looking to get everything patched and protected BEFORE someone tries to exploit the vulnerabilities.

What if you, the IT engineer, could avoid those sleeping in the office kind of issues and get your weekends back? Who would say no to that? As I’ve written about in the past, I’ve been a customer of Pure Storage for about two and a half years now. I started out on an FA-320 array, I’m currently using their FA-400 series and I’m getting ready to start playing with the FlashArray//m as soon as it arrives. One of the things that sold me on Pure Storage was the Non-Disruptive Upgrade (NDU) capabilities for both the software and the hardware of the array (you can see a demo of their NDU here). I’ve gone through almost every iteration imaginable. I’ve done code upgrades (both minor and major revisions), I’ve added additional shelves of disk, I’ve gone from 300 series to 400 series controllers, you name it and I probably done it. The one similarity in every upgrade was that it happened like they said it would happen. No downtime, no performance degradation, no idea that it was happening from a user perspective. They were all quick, seamless, and pain free. They also happened during the week (we played it safe and did them on Friday evenings for our Production units) but on Saturday morning I was home playing with my little boy which is what I care about most.

As I said earlier, this approach appears to be the new status quo. Many other vendors besides Pure Storage are trying to follow suit. EMC has stated that they now support NDU’s (although I’m not sure that is the case for different hardware versions). Other vendors such as Solid Fire and Nimble also support NDU’s. This is a direction that I think everyone in IT welcomes. Being able to provide services quickly to the end user without disturbing their workflow is the goal of nearly every IT staff. This new model greatly increases the success rate of achieving that goal. Pure Storage has gone one step further and changed the typical storage lifecycle model around this principle when they launched Evergreen Storage. The belief is that forklift upgrades will go the way of the dodo bird and you can just replace individual components when needed. Your maintenance never increases (unless you add capacity). Your storage system can stay the same for as long as you need it too saving you tons of money in the long run while also providing you with a solid foundation to house your infrastructure on.

If other systems start following suit and rethink how we look at system lifecycles, the end result can be great for IT Admins. What if it was as easy to upgrade the code on your core switches and routers as it is to upgrade an app on your iPhone? What if said code could be upgraded FROM your iPhone while you’re sipping margaritas on a beach somewhere (just don’t drink too many until the upgrade is done)? What if upgrading your email servers wasn’t a 6 month project? Whether it’s PC refreshes, server upgrades, or application upgrades, a pain-free process is something everyone would welcome and what we currently strive for as IT pros. It’s nice to see that not only can we make end users’ lives easier, I think it’s time that we make our own lives easier as well. Don’t we as IT admins deserve the same level of happiness and time away from the office as our users do? I sure think so. I think you all would agree with me. It’s nice to see that vendors like Pure Storage share that same vision and are doing something to achieve it.