The New RTFM

Most of you who come here know that I post A LOT about the #vCommunity. What you may not know is that I actually have a day job. Who would have thought? In between being a dad, and a husband, and a VMUG Leader in NYC, I’m also a Solutions Engineer for Zerto. I’ve been in this role for over a year and I love it. I really enjoy speaking to customers and learning different and innovative ways of doing things.

As part of this totally awesome gig, I get to talk to a lot of customers and prospective customers about their disaster recovery (DR) and business continuity (BC) plans and approaches. One of my favorite questions to ask is this:

“How many people went into IT to become a DR admin?”

I usually get crickets. Not because it’s a bad thing to be in BC/DR (I make a great living from it), but because it’s not a sexy job. It’s usually a task that gets dumped in your lap for legal or compliance reasons and it takes you away from the things that you WANT to do. It usually involved getting a whole bunch of different teams (Virtualization, Storage, Networking, DBAs, App/Dev, etc) involved and spending a few weeks (usually a few months) preparing for a test that is almost always done during a (holiday) weekend. Who the hell wants to work on the weekend? I sure don’t, that’s why I made the move to the vendor side but that’s a whole other story.

There’s one other component that I haven’t mentioned yet. Runbooks. Ugh. Just the thought of those things make me cringe. Who remembers or still uses those huge loose-leaf binders with hundreds of pages of step by step instructions that were written (and probably not updated) years ago. Once a year you would have to dust them off for instructions on how to recover your environment in the event of a disaster. Then you would have to go page by page with a bunch of other team members and hope that the system matches what is on the page.

You know what is really helpful with this kind of situation? The simple acronym RTFM. I come from the military and this acronym had a very simple meaning

READ

THE

F*ING

MANUAL

That however, is the old RTFM.

Since working at Zerto, I’ve come up with a new meaning.

.

.

.

.

.

.

.

RECOVER

TEST

FAILOVER

MOVE

 

These are some of the essential functions that any IT Resiliency Platform should be able to provide you. By performing these functions, you’ll ensure that your workloads are protected, your data is intact, and your processes are valid. Let’s take a quick look at each of these functions:

 

RECOVER

This is the ability to restore your data. It could be restoring, or as we like to say resuming your VMs or applications. Or it could mean restoring files or folders from a point in time before a disruption.

 

TEST

Testing is probably one of the most important but also most overlooked operations when it comes to IT Resilience. Testing is how you know with great confidence that your systems will work when you attempt to get them back up and running. It’s a way to recover your VMs or applications in practice before having to do the real thing.

 

FAILOVER

Failover is a misleading term. This is actually recovering your VMs or applications at the target site. Think of this as initiating your DR plan in a live scenario. If your production site becomes unavailable for whatever reason, this is how you recover your workloads and make your users happy again. Simply put, when you’re down, get yourself back up and running.

 

MOVE

Zerto has a function called Move VPG which provides you with Application Mobility by migrating a Virtual Protection Group (VPG) to another location. (NOTE: A VPG is comprised of the VMs you are protecting) This could be moving to another storage platform, or another datacenter, moving from one hypervisor to another or even moving to, from or between cloud providers.

 

In order to have a complete IT Resilience platform, I believe you need to be able to perform all of these functions simply and consistently. Stay tuned as I will dive into each operation a bit more and how Zerto specifically performs each function.

 

(Upgrade) Times are changing…..

Stop me if you’ve heard this one before. It’s Monday morning, you open up your newsreader and there are 25 different articles about an exploit that has been found that is sweeping the net. It affects nearly 90% of systems out there. You know it’s only a matter of time until this news goes from being on the tech sites only to the Wall Street Journal and The New York Times. Once that happens, the alert level hits red. Now all of your C-Level execs are aware of the problem and someone is going to be calling you asking for a status update. If you’re Peter Gibbons, you may even have 8 different people calling you. Where do you go from here? In the old days, this would mean, any plans you had for that weekend were scrapped. You’d now have to coordinate outages with your application teams, IT staff, sometimes you’d even have to get your building’s security team involved. You’d also have to break the news to your wife, husband, girlfriend, boyfriend, kids, or whoever that you may not see them again until Tuesday (assuming that all goes well). Then you get to go through this scenario:
  • Planning and executing the downing all of your affected DEV/TEST systems.
  • Preemptively opening cases with your vendors in case you run into an issue (you would hate to get stuck in the queue without a case number while your systems are down)
  • Downloading and applying the patches to fix the vulnerability.
  • Bringing all of said systems back up and running.
  • Contacting all of your applications owners once the systems are back up and having them test all of the applications.
  • Squeeze in a phone call to your loved ones asking about how life is on the outside.
  • Notifying all of your users that the systems are back up and running and that now regular weekend work can commence.
  • Once all of this is done, and you’ve verified that everything is OK and there are no issues, you can now plan to do the same thing to your Production systems. YAY! That usually means another weekend down the toilet.
Many times, some of the pain involved with this type of maintenance can be lessened through mechanisms like vMotion, Exchange DAGs, and clustered systems in general. Typically, you patch each of the secondary nodes in the cluster, then you patch the primary node and you’re good to go. This process of upgrading different cluster nodes can take hours depending on the size of your environment and requires total concentration and focus. If you run into an issue during a failover, you’ll be happy you opened that support case.
Why do I bring all of this up? Traditionally, the one system that usually has the biggest issues during this kind of upgrade/update scenario is your storage environment. Especially if you are on legacy storage for one reason or another. In most cases that I have seen, storage code upgrades are completely ignored unless absolutely necessary. I can see why people make that argument. If your storage goes down, especially in a small to medium sized shop, EVERYTHING goes down. This scares the pants off of a lot people, with good reason. They would rather take the “If it ain’t broke, don’t fix it.” approach of yesteryear. Nobody wants to run into those kinds of problems and lose their weekends because of storage issues. This kind of thinking leads to rolling the dice and hoping that the storage environment will just keep on chugging along and that no one will exploit the vulnerabilities that are out there. I think this model is changing in storage though, along the same lines that the break/fix mentality was replaced with a proactive approach. IT departments are getting more sophisticated and are looking to get everything patched and protected BEFORE someone tries to exploit the vulnerabilities. 
What if you, the IT engineer, could avoid those sleeping in the office kind of issues and get your weekends back? Who would say no to that? As I’ve written about in the past, I’ve been a customer of Pure Storage for about two and a half years now. I started out on an FA-320 array, I’m currently using their FA-400 series and I’m getting ready to start playing with the FlashArray//m as soon as it arrives. One of the things that sold me on Pure Storage was the Non-Disruptive Upgrade (NDU) capabilities for both the software and the hardware of the array (you can see a demo of their NDU here). I’ve gone through almost every iteration imaginable. I’ve done code upgrades (both minor and major revisions), I’ve added additional shelves of disk, I’ve gone from 300 series to 400 series controllers, you name it and I probably done it. The one similarity in every upgrade was that it happened like they said it would happen. No downtime, no performance degradation, no idea that it was happening from a user perspective. They were all quick, seamless, and pain free. They also happened during the week (we played it safe and did them on Friday evenings for our Production units) but on Saturday morning I was home playing with my little boy which is what I care about most.
As I said earlier, this approach appears to be the new status quo. Many other vendors besides Pure Storage are trying to follow suit. EMC has stated that they now support NDU’s (although I’m not sure that is the case for different hardware versions). Other vendors such as Solid Fire and Nimble also support NDU’s. This is a direction that I think everyone in IT welcomes. Being able to provide services quickly to the end user without disturbing their workflow is the goal of nearly every IT staff. This new model greatly increases the success rate of achieving that goal. Pure Storage has gone one step further and changed the typical storage lifecycle model around this principle when they launched Evergreen Storage. The belief is that forklift upgrades will go the way of the dodo bird and you can just replace individual components when needed. Your maintenance never increases (unless you add capacity). Your storage system can stay the same for as long as you need it too saving you tons of money in the long run while also providing you with a solid foundation to house your infrastructure on.
If other systems start following suit and rethink how we look at system lifecycles, the end result can be great for IT Admins. What if it was as easy to upgrade the code on your core switches and routers as it is to upgrade an app on your iPhone? What if said code could be upgraded FROM your iPhone while you’re sipping margaritas on a beach somewhere (just don’t drink too many until the upgrade is done)? What if upgrading your email servers wasn’t a 6 month project? Whether it’s PC refreshes, server upgrades, or application upgrades, a pain-free process is something everyone would welcome and what we currently strive for as IT pros. It’s nice to see that not only can we make end users’ lives easier, I think it’s time that we make our own lives easier as well. Don’t we as IT admins deserve the same level of happiness and time away from the office as our users do? I sure think so. I think you all would agree with me. It’s nice to see that vendors like Pure Storage share that same vision and are doing something to achieve it.

How do I know which storage array is right for me?

After a ton of positive feedback on my last post (thank you all for that), people wanted to know more. Specifically, how did I come to the decision on what product was right for my environment? Hopefully, this post will help guide you in the right direction and maybe point something out that you didn’t think of previously. I’m going to do my best to generalize this so you can compare and contrast vendors on your own. Every environment is different so you’ll have to cater these guidelines to your situation. No one is going to know what you need better than YOU! This actually leads me into my first point

Identify Your Needs
This step is the most important in my opinion but it is often the most overlooked. Why are you looking at new storage in the first place? Are you experiencing a performance problem that you (and/or your current vendor) cannot resolve? Are there limitations with your current setup that are preventing you from providing the necessary services that your customers require? Is there a new project or initiative at your firm that is presenting you with a new set of requirements altogether? An example of this is when your clients request storage replication for DR/BCP purposes where there was no need prior. Or is it a situation where your array was installed while Saved By The Bell was still on the air and it’s just time for you to find out what the latest and greatest product is and how fast you can get it installed in your environment? Also, do you need Fibre Channel, iSCSI, direct attached or something else entirely? Once you have a clear and concise understanding of what you are looking for and why, the rest of the search is much simpler.

Cost
Unless your name is Tony Stark, Bruce Wayne or Richie Rich cost is a major factor in any IT purchase. You’re going to have a budget that you need to stick to and you also need to get the most bang for the buck. This is a step that can get very tricky if you don’t have a clear picture of your environment. Obviously, the cost of the array itself and the associated support & maintenance are huge factors in what your overall spend will be. There are other things to consider as well.

  • What does your environment look like now?
  • Are you in a Co-Lo facility?
  • What is your current monthly OPEX spend from a power, cooling and rackspace perspective?
  • What are your power requirements? Does your current array require dedicated circuits to run? What is the additional cost of those circuits?
  • How many rack units and/or full racks does your current setup use? How many do you have available?
  • What is the total $/GB(or TB)?
  • Are there additional costs consideration? Will your existing SAN support the additional port density? Will you need to purchase additional networking equipment, or cables to support the new requirements?
  • Are there software costs to consider? Do you have to license individual features such as replication, snapshots, etc? Or is it included with the cost of the array?
  • What are the costs for support and maintenance? Do these costs increase substantially over time or will they remain flat for the lifespan of the array? Does maintenance entitle you to any new features or hardware?
  • Better yet, will the solutions that you are looking into DECREASE any of the above mentioned costs? Will you save money on monthly OPEX costs thus lowering the TCO for your solution?

These are some of the things that you need to consider when calculating what your total spend will be. I’ve never met a CxO that likes to be surprised by large increases in their monthly or yearly budget that they didn’t plan for. It usually means a nice conversation with the CFO which never ends well for the CxO and ultimately it doesn’t end well for the person responsible for the increase.

Performance
Now that you know what your needs are and how much you can spend on your shiny new array, it’s time to get down to business. It has to live up to the hype. You’re going to step in front of your boss in a conference room with a fancy PowerPoint presentation that took you 6 weeks to prepare since you’re a technical person not a PowerPoint guru. You need to justify this exorbitant expense that you are throwing in front of them. The array HAS TO perform. If you are looking at a new array to resolve a performance issue it DEFINITELY needs to perform. You’re going to be looking at All-Flash Architectures, Hybrid arrays, solutions that leverage tiering, server-side solutions, you name it, and I’m sure it will pop up during your search in one way, shape, or form. Once again, the only one who can tell you what is right for you, is you. Make sure you perform baselines before you start looking at solutions so you know what your IOPS, Latency and Bandwidth requirements are. It will help narrow down the possible solutions that suit your needs.

Capacity
Along with performance, question 1A is usually “How much space do I need?”. Seems like a pretty obvious question as well. Along with how much space you need, you should be asking yourself, why so much space? Are you just looking for a performance enhancement but the capacity that you have is more than sufficient? You have 100TB now so you’ll get 100TB on my new array? Are you taking growth into consideration? Is what you’re buying now sufficient to hold you over for the next 3-5 years and beyond? How difficult is it to add new capacity to the array you want to purchase a year from now, 3 years from now or 5 years from now? Can capacity be added non-disruptively? (HUGE POINT in my opinion) What type of storage are you looking at? Are you looking at tiers, all-flash, SAS, SATA? How much of a concern is speed? What type of data will be stored on the array (VMs, Databases, Email, Archive, File)? This is an area you need to be relatively sure of prior to purchase or make management aware that additional capacity may be needed in the future. You don’t want to walk in to your CxO’s office 18 months after you buy an array asking for more money because you didn’t buy enough disk. Depending on your CxO, that can turn into a resume generating event.

Features
Now that you know how fast your disk needs to be and how much of it you need, it’s time to look at the other factors that you should consider. For me, the first was simplicity. I’ve worked with at least a dozen different arrays. The bottom line is storage is not the easiest area to deal with if you are not a seasoned storage vet. Especially when you get into the hundreds of terabytes and petabytes. Smaller shops usually feel the pain of this a little more than enterprises do. They may have really good Windows & VMware admins but most of the jack-of-all-trades guys learn storage last. Enterprises usually have dedicated storage teams that only do storage day in and day out. Having an array that is easy to configure and more importantly easy to manage should definitely be on your checklist if you are a novice or even if you’re a top tier storage admin. You’ll need your time to manage the legacy environments that are still lingering. The top of your list should also contain Non-Disruptive Upgrades (NDU). We all know what a pain having to schedule downtime for an array can be. You basically have to bring down EVERYTHING and hope it comes back up normally. Wouldn’t it be nice if that went away and you could upgrade your array as easily as you upgrade an app on your iPhone? There are other features that you should look for like Deduplication, Snapshots, Replication and hypervisor compatibility for virtualized shops. VAAI support makes a huge difference in vCenter environments. You’ll also need to figure out how easy it will be to migrate your data. If you’re a VMware user, it should be as simple as a Storage vMotion. Physical hosts can be a little trickier but most vendors will provide guidance and assistance when necessary. A lot of the features that you’ll need will be extremely apparent just from dealing with your current situation. You know what you like and what you don’t, now is your time to fix all of those issues that you’ve hated for years.

Next Steps
Meet with vendors, lots of them. See what you like and dislike from all of them. Try to gauge which solution meets your needs. You should have the knowledge at this point of what you need, what is most important to you and how much you can spend. Try to get the most bang for your buck. One thing to remember is that you are the customer and you have to do what is right for YOUR company. Making a sales person happy is not your job, making your end users and your management happy is. When all is said and done, if you’re still not sure, make like you’re buying a new car. Take it for a test drive. Most vendors can set up Proof of Concept (POC) boxes for you and you can test the array with your own data. Nothing will show you if a solution will work better than slapping a copy of your VMs on the box and going to town on them. Run the reports you normally run, try your backup jobs, run all of your applications at as close to a production load as you can. What you put into your testing will show tenfold when the production array shows up. You’ll now have a familiarity with the array and you’ll have reasonable expectation on how it will perform. If you took baselines like I suggested earlier, you’ll even have data to compare to. Also, speak to your peers and read up as much as you can. There are plenty of engineers and admins that have gone through this process before you. Don’t try to reinvent the wheel. Use all the help that you can find around you. Hopefully you have done your homework and you’ll be on the right track to storage happiness.

For those of you who are curious, here’s a simple breakdown of what my evalutation looked like. Obvious I went into much more detail during my search but this proves that you can figure out your needs with just a few bullet points.

Identify Your Needs – Fast performing, small footprint, low power consumption, cut down on FC ports if possible since we’re nearing capacity on our SAN.
Cost – Had to stay within my budget (numbers withheld for confidentiality reasons)
Performance – Must be able to run Tier 1 apps without affecting other apps and servers running on the array.
Capacity – Expected growth was 150% over three years. Looked for double the usable capacity of current system. Must be able to add additional shelves as need arises.
Features – Simplicity, NDU, Deduplication, Snapshots, Replication,
Next Steps – Met with 10-12 vendors, performed 3 POC’s. Found an array that met the majority of my needs and the remaining needs were on their roadmap. We have Loved Our Storage ever since.

Hopefully this guide will help you in your search. I remember the pain that I went through during this process. I’d love to save you from going through the same. The thing to remember is that this is the tip of the iceberg. You still need to install the array and migrate your data. The quicker you can settle on what works for you, the quicker you can get down to the fun stuff. Feel free to reach out with any questions and please leave feedback if you can. Good luck in your search.