Stop me if you’ve heard this one before. It’s Monday morning, you open up your newsreader and there are 25 different articles about an exploit that has been found that is sweeping the net. It affects nearly 90% of systems out there. You know it’s only a matter of time until this news goes from being on the tech sites only to the Wall Street Journal and The New York Times. Once that happens, the alert level hits red. Now all of your C-Level execs are aware of the problem and someone is going to be calling you asking for a status update. If you’re Peter Gibbons, you may even have 8 different people calling you. Where do you go from here? In the old days, this would mean, any plans you had for that weekend were scrapped. You’d now have to coordinate outages with your application teams, IT staff, sometimes you’d even have to get your building’s security team involved. You’d also have to break the news to your wife, husband, girlfriend, boyfriend, kids, or whoever that you may not see them again until Tuesday (assuming that all goes well). Then you get to go through this scenario:
- Planning and executing the downing all of your affected DEV/TEST systems.
- Preemptively opening cases with your vendors in case you run into an issue (you would hate to get stuck in the queue without a case number while your systems are down)
- Downloading and applying the patches to fix the vulnerability.
- Bringing all of said systems back up and running.
- Contacting all of your applications owners once the systems are back up and having them test all of the applications.
- Squeeze in a phone call to your loved ones asking about how life is on the outside.
- Notifying all of your users that the systems are back up and running and that now regular weekend work can commence.
- Once all of this is done, and you’ve verified that everything is OK and there are no issues, you can now plan to do the same thing to your Production systems. YAY! That usually means another weekend down the toilet.
Many times, some of the pain involved with this type of maintenance can be lessened through mechanisms like vMotion, Exchange DAGs, and clustered systems in general. Typically, you patch each of the secondary nodes in the cluster, then you patch the primary node and you’re good to go. This process of upgrading different cluster nodes can take hours depending on the size of your environment and requires total concentration and focus. If you run into an issue during a failover, you’ll be happy you opened that support case.
Why do I bring all of this up? Traditionally, the one system that usually has the biggest issues during this kind of upgrade/update scenario is your storage environment. Especially if you are on legacy storage for one reason or another. In most cases that I have seen, storage code upgrades are completely ignored unless absolutely necessary. I can see why people make that argument. If your storage goes down, especially in a small to medium sized shop, EVERYTHING goes down. This scares the pants off of a lot people, with good reason. They would rather take the “If it ain’t broke, don’t fix it.” approach of yesteryear. Nobody wants to run into those kinds of problems and lose their weekends because of storage issues. This kind of thinking leads to rolling the dice and hoping that the storage environment will just keep on chugging along and that no one will exploit the vulnerabilities that are out there. I think this model is changing in storage though, along the same lines that the break/fix mentality was replaced with a proactive approach. IT departments are getting more sophisticated and are looking to get everything patched and protected BEFORE someone tries to exploit the vulnerabilities.
What if you, the IT engineer, could avoid those sleeping in the office kind of issues and get your weekends back? Who would say no to that? As I’ve written about in the past, I’ve been a customer of Pure Storage for about two and a half years now. I started out on an FA-320 array, I’m currently using their FA-400 series and I’m getting ready to start playing with the FlashArray//m as soon as it arrives. One of the things that sold me on Pure Storage was the Non-Disruptive Upgrade (NDU) capabilities for both the software and the hardware of the array (you can see a demo of their NDU here). I’ve gone through almost every iteration imaginable. I’ve done code upgrades (both minor and major revisions), I’ve added additional shelves of disk, I’ve gone from 300 series to 400 series controllers, you name it and I probably done it. The one similarity in every upgrade was that it happened like they said it would happen. No downtime, no performance degradation, no idea that it was happening from a user perspective. They were all quick, seamless, and pain free. They also happened during the week (we played it safe and did them on Friday evenings for our Production units) but on Saturday morning I was home playing with my little boy which is what I care about most.
As I said earlier, this approach appears to be the new status quo. Many other vendors besides Pure Storage are trying to follow suit. EMC has stated that they now support NDU’s (although I’m not sure that is the case for different hardware versions). Other vendors such as Solid Fire and Nimble also support NDU’s. This is a direction that I think everyone in IT welcomes. Being able to provide services quickly to the end user without disturbing their workflow is the goal of nearly every IT staff. This new model greatly increases the success rate of achieving that goal. Pure Storage has gone one step further and changed the typical storage lifecycle model around this principle when they launched Evergreen Storage. The belief is that forklift upgrades will go the way of the dodo bird and you can just replace individual components when needed. Your maintenance never increases (unless you add capacity). Your storage system can stay the same for as long as you need it too saving you tons of money in the long run while also providing you with a solid foundation to house your infrastructure on.
If other systems start following suit and rethink how we look at system lifecycles, the end result can be great for IT Admins. What if it was as easy to upgrade the code on your core switches and routers as it is to upgrade an app on your iPhone? What if said code could be upgraded FROM your iPhone while you’re sipping margaritas on a beach somewhere (just don’t drink too many until the upgrade is done)? What if upgrading your email servers wasn’t a 6 month project? Whether it’s PC refreshes, server upgrades, or application upgrades, a pain-free process is something everyone would welcome and what we currently strive for as IT pros. It’s nice to see that not only can we make end users’ lives easier, I think it’s time that we make our own lives easier as well. Don’t we as IT admins deserve the same level of happiness and time away from the office as our users do? I sure think so. I think you all would agree with me. It’s nice to see that vendors like Pure Storage share that same vision and are doing something to achieve it.