Your Mission Critical Applications Deserve Real Backup Validation!

accident disaster steam locomotive train wreck

Photo by Pixabay on Pexels.com

Every organisation knows how important data protection is. However still most of the organisations never test their backups. Why? Because it is complex issue. However if you do not test how do you know that you can really survive from disasters?

Modern Approach for Data Protection

First step for Data Protection is of course thinking it in modern ways. Even that most of the restores from backups are still individual files and/or folders your organisation has to prepare for bigger disasters. What if half of your 500 VM production environment is hit by ransomware attack? How do you really survive from these types of disasters? Restore times with so called legacy backups might take days, even weeks. Can your business survive without access to these VMs for weeks, propably not.

Data protection depends upon timely backups, fast global search, and rapid recovery. Cohesity DataProtect reduces your recovery point objectives to minutes. Unique SnapTree technology eliminates chain-based backups and delivers instantaneous application or file-level recovery, with a full catalog of always-ready snapshots even at full capacity. With this approach it can dramatically reduce recovery time.

However even modern data protection is not enough if you don’t know that you really have something to recover to. Most of the modern technologies handle file system level data integrity but still there is no way to really know that your backups are fully recoverable without testing them.

From Data Protection to Recovery Testing

Typically organisations approach recovery testing with either just recovering single (or multiple) virtual machines. This of course makes sure that you can recover individual VMs but it doesnt ensure that you can recover something that is truly working. Some backup vendors implement recovery testing, but still it is mostly just VMs or some basic uptime testing.

Other way to do this is manually restore application setups and do manual testing. This is very costly because it requires lot’s of manual work, and also introduces several risks. However it enables your organisation to really test application workflows with proper testing. Do you really get answer from your 3-layer web application, can you get answers to your DB queries etc. What if you could take this method of running complex testing but without any need for manual labour?

Automating Recovery Testing

Because modern hypervisor platforms are API driven it is pretty easy to automate things on VM level. When you add API driven data protection platform, like Cohesity, you can automate full recovery testing with very complex testing. This is a issue I hear from most of my Service Provider customers – but also from bigger enterprise customers. How to automate complex recovery testing? Lets see….

Cohesity Backup Validation Toolkit

To make things simpler, you can download Cohesity Backup Validation toolkit from here and with minimal scripting knowledge it is easy to automate validation process.

After downloading it is time to create some config-files. Lets start with environment.json -file. This file contains connection information for both Cohesity, and VMware vSphere environments. Create file with content:

{
        "cohesityCluster": cohesity-01.organisation.com",
        "cohesityCred": "./cohesity_cred.xml",
        "vmwareServer": "vcenter-01.organisation.com",
        "vmwareResourcePool": "Resources",
        "vmwareCred": "./vmware_cred.xml"
}

After this we need to create actual config.json -file containing information about each virtual machine we are about to test.

This file also defines tests per VM so it is very easy to define multiple tests but only use selected per VM. Script also enables you to attach VM to needed test network, and change IP address to different for testing purpose so you don’t need to test with overlapping production IPs, or create siloed networking for VMware.

Note that VMs don’t need to be protected with same protection job making this more scalable since propably you have different job for web frontends and actual backend databases.

{
    "virtualMachines": [
        {
            "name": "Win2012",
            "guestOS": "Windows",
            "backupJobName": "VM_Job",
            "guestCred": "./guestvm_cred.xml",
            "VmNamePrefix": "0210-",
            "testIp": "10.99.1.222",
            "testNetwork": "VM Network",
            "testSubnet": "24",
            "testGateway": "10.99.1.1",
            "tasks": ["Ping","getWindowsServicesStatus"]
        },
        {
            "name": "mysql",
            "guestOS": "Linux",
            "linuxNetDev": "eth0",
            "backupJobName": "VM_Job",
            "guestCred": "./guestvm_cred_linux.xml",
            "VmNamePrefix": "0310-",
            "testIp": "10.99.1.223",
            "testNetwork": "VM Network",
            "testSubnet": "24",
            "testGateway": "10.99.1.1",
            "tasks": ["Ping","MySQLStatus"]
        }
    ]
}

And then final step is to create actual credential files. To prevent having usernames and password in configuration files in plaintext format we can use  simple powershell script to create these. You can have one shared credential file for all VMs, or you can have one per VM. Note that these users must have administrator level access to VMs to change IP network to test network.

To create credential files you can use included createCredentials.ps1 script which will create only one guestvm_cred.xml file but if you want to create more you can just simply run simple powershell command:

Get-Credential | Export-Clixml -Path guestvm_more.xml

Since this file is encrypted it can be only accessed with same user who created file, so make sure that you create credential files with same user you are using for running testing scripts.

So How Does it Work?

Here is an example run to clone two virtual machines (one Linux and one Windows) and run different set of tests on each VM.

First script gets configuration files and connects to Cohesity cluster and VMware vSphere vCenter environments. Then it will start clone process for VMs

blog1

blog2

and after clone process is done it will move to actual validation phase where we will first check that clone task is in success state and actual VMware VM’s are powered on with VMware Tools running in VMs cloned.

blog3

When VMs are up and VMware Tools are running we will run test per VM to ensure that we can push scripts trough VMware Tools. Next task is to move VMware VM to correct VM Network and then change IP configuration for each VM.

blog4

After moving VMs to correct network we will run tests for each VM

blog5

and after running tests we will clean clones automatically from Cohesity and VMware environment

blog6

Notes

This automation toolkit is not officially provided by Cohesity. Any bugs can be reported to me directly. There are some limitations with this toolkit:

You can use it only for VMware environments running vCenter

You can run tests only against Linux and Windows virtual machines and Windows machines need to have PowerShell installed.

Hope that this simple toolkit helps you to automate your organisations backup validations!

Building a Modern Data Platform by Exploiting the True Possibilities of Public Cloud

analysis blackboard board bubble

Photo by Pixabay on Pexels.com

Building a modern next-generation datacenter requires specific approach and understanding of automation. When we are designing modern datacenter we have to understand that data is center element in todays business. We have to understand that with automation we can not only save time but ensure that human error factor is reduced dramatically.  On-premise datacenter, even the next-generation one, is still only one element of data platform and since no modern data platform would be complete without having option to use the public cloud and in fact public cloud plays significant role in building modern data platform and providing all the capabilities we just couldn’t get any other way.

In this post we look look the benefits of public cloud while taking care that we overcome all the challenges we might see in public cloud adoption embracing cloud as functional key element of our platform.

Why our data needs public cloud?

While the modern storage systems are very good and past couple years they have evolved lot modern data centric approach and fast changes in business landscapes require flexibility, scalability, data movement and commercial approach which makes quickly clear that cloud can be potentially answer for all of these challenges.

While these business challenges are quite common in pretty much all traditional systems they are area where public cloud can be strongest. Cloud can, in theory, scale infinite and provide consumption model where organisations can move CAPEX investments to OPEX by paying only what they need while still having option of flexibility by going bigger or smaller based on current business requirement. But cloud can do also much more. We can easily take copy of our data and do pretty interesting things with it – once it is copied or moved to public cloud. Typically organisations start with low-hanging fruits, backups, since they are very easily moved from on-premise to cloud since pretty much every modern backup software supports extension to cloud (If your’s doesn’t maybe it’s very good time to look something better). When we backup our data to public cloud we can actually benefit more from it. We can use this cold data for business analytics or artificial intelligence. But it can work also as a disaster recovery. With proper design this can be way cheaper than building disaster recovery site. In the end flexibility is the most compelling reasons for any organisation to consider leveraging public cloud.

But while these benefits are pretty clear why so many organisations fail to meet these benefits by not moving to cloud?

Why organisations resist moving to the cloud?

It’s not about what public cloud can do it is more about what it doesn’t that tends to stop organisations wholeheartedly embracing cloud when it comes to organisations most valuable assets, data.image.png

As we’ve worked through the different areas of building a modern data platform our approach to data is way more than just storage. It is insight of data, protection, security, availability, and privacy, and these are things not normally associated with native cloud storage. Traditionally native cloud storage is not built to handle these types of needs but to be pretty much easily scalable and cheap.  And since organisations got so used to these requirements they don’t want to move their data to cloud if it means losing all of those capabilities, or having to implement and learn a new set of tools to deliver them.

Of course there is also the “data gravity” problem, we can’t have our cloud based data siloed away from the rest of our platform, it has to be part of it. We need to be able to move data in to the cloud but also ensure that we can move it back to on-premise again and even between cloud providers while still retaining all of those key elements that enterprise organisations require – control and management.

So is there really a way to overcome these challenges and have cloud as fundamental part of modern data platform. Yes, there is.

Making cloud be part of the enterprise data platform

There are dozens and dozens companies trying to solve this issue. Most of them start from the top without really looking the real problem, data mobility.  If you look AWS Marketplace’s storage category you will see almost 300 different options available so the question is how one knows which really gives organisation full potential for true hybrid cloud. The answer is, one really cant without deep knowledge. I will not point any single vendor but quite many makes claims that they can give you data mobility and leverage your data in full potential while only few of them can really do this.

There are two things making this very hard.

First is data movement between on-premise and cloud. It’s pretty easy to copy data from point A to point B but how to make this cost efficient and fast. Moving huge amounts of data takes time even with very fast internet connections so having builtin capabilities of moving only needed blocks can make significant difference not only in migration/movement times but since pretty much all cloud vendors charge egress traffic when it is time to move data back to on-premise or to other cloud vendor this can mean a huge difference in costs.

Second is ability to use migrated/moved data to several purposes. Using cloud as backup target is quite inefficient if you cannot use the same data as source for DR, analytics, AI or test&dev. Cloud storage doesn’t cost that much but if you can use if efficiently in more than one use case you will reduce the total cost quite much.

Both of these are foundation of enterprise capabilities. And while adding enterprise capabilities are great, the idea of a modern data platform relies on having our data in the location we need it, when we need it while maintaining management and control. This is where the use of efficient technology provides real advantage. You can achieve this in many ways one being for example using NetApp’s ONTAP  storage system as a consistent endpoint allowing organisations to use the same tools, policies and procedures at the core of data platform and extend this to organisations data in the public cloud. This is possible if vendor has an modern software-defined approach.

NetApp’s integrated SnapMirror provides the data movement capabilities so one can simply move data in and out of and between clouds. Replicating data in this way means that while on-premise version can be the authoritative copy, it doesn’t have to be the only one. Replicating a copy of  data to a location for a one off task, which once completed can then be destroyed, is a powerful capability and an important element of simplifying the extension of organisations data platform into the cloud.

So technology matters?

In short answer, no. One doesn’t need to use technology vendor X to deliver true hybrid cloud service. You do not need to use NetApp but I have used it as an example since it has nice cloud integration features built-in and because of that it can deliver modern data platform easily by providing consistent data services across multiple locations (on-premise and cloud) while still maintaining all critical enterprise controls. Of course this means that you need to have NetApp on-premise and in cloud.

When you evaluate vendor Y for your next-generation datacenter it is very critical to think how you can build your enterprise data platform to have an option to expand your business to cloud. While there are other data service providers having somewhat similar services as NetApp I think that NetApp’s story and capabilities are in line with the requirements for modern data platform. There are more solutions which can be used to achieve similar solution and even go bit further but I will cover one of them in my next post.

In the end most important thing on your design sterategy, if it is to include public cloud, is to ensure that you have appropriate access to data services, integration, control and data management. It is crucial that you don’t put your organisations most valuable asset, data, at risk or dimish the capabilities of your data platform by using the cloud. Cloud is playing huge role in future data plaforms so make sure you have easy option to move workloads to cloud – and back.

Primary and Secondary Storage aka Tiering in 2017

spicerack_tiers1_525

History of Hierarchical Storage Management

Having multiple storage tiers is not a new thing. Actually history of hierarchical storage management (HSM) goes quite far. It was actually first implemented by IBM on their mainframe computer platforms to reduce to the cost of data storage, and to simplify the process to get data from slower media. Idea was to that the actual user would not need to know where the data was actually stored and how to get it back – with HSM the computer would retrieve the asked data automatically.

Historically HSM was somewhat buried when world went from 1st platform to 2nd platform (Client-Server, PC-era). Quite soon many organizations realised they still had application needs for centralized storage platforms and so the storage area networks (SAN) was pretty much born. After server virtualization exploded need for high performance storage organizations realized that, again, it was too expensive to run all application data in one high performance storage tier or it was just too difficult to move data between isolated storage tiers. Tiering, or HSM, was actually born again.

Many storage vendors implemented somekind of tiering system. Some implemented system monitoring actual hot blocks and then migrated those hot blocks between slower and faster tier or 3-tier (SSD, SAS and SATA). Comparing these systems only difference was typically size of the block moved and frequency of moving blocks. This was approach for IBM, EMC and HDS (and many others also), just few to name. There was no big problem with this approach since it solved many problems but in many cases it just reacted too slow to performance needs. With proper design this works very well.

Other vendors implemented tiering based on caching. Every storage system has cache (read and write) but these vendors approach for tiering was implementing method to add high performance disks (SSD) to extend size of the cache. This reacts very fast to changes and typically doesn’t need any tuning. However this approach doesn’t allow you to pin application data to selected tier so proper design is critical.

All flash storage changed the game

aaeaaqaaaaaaaaoaaaaajdmwnze4mzm0ltuwywytngm0os1hntqyltvkotmwytq1y2iznw

Late 2000 all flash storage systems moved very high performance applications from tiered storage to pure flash platform. Price of the flash was very high and typically you had one all flash system per application. Few years later all flash systems actually replaced spinning disks and tiered storage pretty much on most of the organizations since pricing of the flash became affordable and implementation of efficiency technologies (deduplication and compression) meant you could put more data on same disk capacity which actually dropped the gigabyte pricing quite close to high performance spinning disks (10/15k SAS drives).

Suddenly you didn’t need any tiering since all flash systems gave you enough performance to run all your applications but this introduced isolated silos and moving from 2nd platform to 3rd platform means very dramatical growth in data amounts.

Living in the world of constant data growth

Managing unstructured data continues to be a challenge for most of the organizations. When Enterprise Strategy Group surveys IT managers about their biggest overall storage challenges, growth and management of unstructured data comes out at or near the top of the list most of the time.

And that challenge isn’t going away. Data growth is accelerating, driven by a number of factors:

The Internet of Things

We now have to deal with sensor data generated by everything. Farmers are putting health sensors on livestock so they can detect issues early on, get treatment and stop illness from spreading. They’re putting sensors in their fields to understand how much fertilizer or water to use, and where. Everything from your refrigerator to your thermostat will be generating actionable data in the not too distant future.

2016-07-12-1468314021-5633148-internetofthings

Bigger, much richer files

Those super-slow motion videos we enjoy during sporting events are shot at 1,000 frames per second with 2 MB frames. That means 2 GB of capacity is required for every second of super-slow motion video captured. And it’s not all about media and entertainment; think about industry-specific use cases leveraging some type of imaging, such as healthcare, insurance, construction, gaming and anyone using video surveillance.

More data capture devices

More people are generating more data than ever before. The original Samsung Galaxy S smartphone had a 5 megapixel camera, so each image consumed 1.5 MB of space compressed (JPEG) or 15 MB raw. The latest Samsung smartphone takes 16 megapixel images, consuming 4.8 MB compressed/48 MB raw storage — thats a 3 fold increase in only few years.

Enterprise Data Lake is 2017 version of tiered storage

Tiered storage, as it used to be implemented, has born again with next generation of old idea. In modern world tiering is solving problems related to massive data growth. More and more production data is going to All Flash arrays but since only 10-30% of actual data is really hot organizations must implement somekind of secondary storage vision to be able move cold data from still expensive primary storage to much cheaper secondary storage.

The secondary storage today is object based storage to response fast pace of data growth and data locality problems of IoT. Organizations are going to use this same object storage platform for their Internet of Things needs and also maybe place to hold their production application data backups.

 

How to backup object storage?

banner

Background

Traditional file server may have outlived it’s usefulness for an increasing number of organizations in past years. Traditionally these file servers were designed in an era where typically all employees were sitting in one location and remote workers / road warriors were quite rare and even then only files they created were office documents (word, excel, powerpoint, etc). However time has changed and now it’s new normal that organizations have employees all over the globe and also IoT (Internet of Things) has changed landscape very much; they generate far more data than normal users ever will and this data must be kept in safe location since data is now the main asset of many organizations and normal file servers are not enough to meet the requirements of the modern data center anymore.

So we got an object storage to solve issues but first you need to understand what is an object storage and how it differs from traditional file servers.

So what is an object storage?

For years data was typically stored in a POSIX file system (or databases but let’s focus on file services here) where data is organized in volumes, folders/directories and sub-folders/sub-directories. Data written to file system contains two actual pieces; the data written to file system itself and also metadata which in POSIX file systems is typically simple and includes only information about creation date, change date, etc. However object storages work a bit differently.

Object storage is also used to store data but unlike POSIX file systems, an object storage system gives each object unique id, rather than name, which is managed in flat index instead of folders. In POSIX file systems applications access files based on directory and filenames but in object storage they access files (or objects here) by providing an unique object id to fetch/re-write information. Because of this flat architecture object storage provides much greater scalability and faster access to much higher quantity of files compared most traditional POSIX based file servers. Flat architecture enables also much richer medata information for actual data since most object storage systems allow much broader set of information stored  per object than traditional POSIX file system. You can store all the same information (creation date, change date, etc) but also add additional information like expiration dates, protection requirements, information for application what type of file is, what kind of information it contains etc.

So in general object storages are designed to handle an massive amount of data; it can be data stored in large objects like video/images/audio or very billions of very small objects containing IoT device information like sensor data. Typical object storage can scale to very big and this raises a question. How we can backup the data stored in object storages?

Why object storage is not that easy to backup?

Problems Ahead

Traditionally object storage solutions are considered when you have massive amounts of data like petabytes of data and/or billions of objects (files). This kind of platform challenges your traditional enterprise backup solution. Think about it while. If you need to backup all changed objects how long it will take for enterprise backup server to lookup which object has been changed and then fetch them from storage and put them on disk and/or tape. With large scale environments this slowly becomes just impossible. You need to implement other kind of solutions.

From backups to data protection

First step here is to change our minds from backups to data protection. Backup is just a method to protect your data, isn’t it? When we understand this we can start thinking how we can protect our data in object storage environments. There is always the traditional way to protect it, but let’s think how object storage environments are protected in real life.

Object Storage Systems tend to have ability to have versioning built in. What this is is just a method to save old version of object in case of change or delete. So even when user deletes objects they are not removed from system in real life but just marked as deleted. When we combine this to multisite solutions we actually have built in data protection solution capable to protect your valuable data. However we need to have also smart information lifecycle management in our object storage to be able to actually remove deleted objects when our ILM rules says so. Like after ‘deleting’ (marking object deleted) we would still keep last version for 5 years and after that we will remove it from system.

This kind of approach has one limitation you must keep in your mind what comes to data durability. What typical data protection systems do not protect against is bit rot – the concept of magnetic degradation over time. Magnetic storage degrades over time and this is not a matter if the data will be corrupted but rather it is when it will. The good thing here is that enterprise ready object storage solution should have methods to prevent this since should have methods to monitor and repair the affects of magnetic degradation with use of unique identifier for each object that is the result of some type of algorithm (e.g. SHA-256) being run against the contents of the object. By re-running the algorithm against an object and comparing it against its unique identifier, the software makes sure that no bit rot corrupts the file.

There is of course still one issue which must be solved different way – nasty admin. There is always possibility that storage admin of object storage solution just deletes whole system. But let’s think this in future post coming later this spring!

 

Virtuaalikoneiden varmistaminen VMwaren Data Recoveryllä

Yleistä

VMware toi Data Recovery-nimisen tuotteensa markkinoille jo vSphere-päivityksen yhteydessä mutta ensimmäisen version suuret ongelmat rajoittivat tuotteen käyttöönottoa useissa yrityksissä. Nyt VMware on päivittänyt Data Recoveryä ja se alkaa olemaan tuotteena sitä jollaiseksi sen olisi alun perinkin pitänyt tehdä.

Tuotteen idea on yksinkertainen; VMware Data Recovery on ns. appliance-pohjainen varmistuspalvelin jonka asentamisen jälkeen voidaan vSphere-ympäristön virtuaalikoneista ottaa image-tason varmuuskopiot ja näistä varmuuskopioista voidaan tarvittaessa palauttaa koko kone tai vain yksittäisiä tiedostoja (Tämä vaatii tuen, tällä hetkellä Windows sekä Linuxista RedHat ovat tuettuja). Data Recovery sisältää myös datan de-duplikoinnin, joka tarkoittaa että samaa tietoa ei tallenneta varmistuspalvelimen levyille kuin kertaalleen. Varmistukset otetaan incremental-kopioina joten varmistusajossa varmistetaan vain muuttuneet tiedostot ja näin voidaan pienentää varmistukseen käytettävää aikaa merkittävästi.

VMware Data Recovery ei tue datan varmistusta nauhajärjestelmiin vaan toimii ainoastaan levyvarmistuspohjaisesti. Applikaatio tukee FC, NAS sekä lokaalia levyä joka mahdollistaa luonnollisesti hyvinkin monenlaiset käyttökohteet, suosittelen kuitenkin joko FC:n kautta näytettyä LUNia tai dedikoitua NAS-järjestelmää.

Käyttönotto

VMware Data Recoveryn käyttöönotto on todella helppoa. Hommaa varten tarvitaan olemassa oleva vSphere-ympäristö sopivalla lisenssillä, itse tuoteen käyttöoikeus on jo vSphere Essentialissa mukana.

Ennen käyttöönottoa on varmistettava että seuraavat portit on sallittu tietoliikennelaitteissa:

  • VDR käyttää vCenterin webservices-rajapintaa, tämä tarvitsee auki portit 80 ja 443
  • VDR:n varmistuskoneen plugin ja tiedostotason palautus (FLR, File Level Restore) vaatii yhteyden VDR-applianceen porttiin 22024
  • VDR-appliance ottaa yhteyttä VMware ESX:n tai ESXi:n porttiin 902

VDR vaatii käyttöoikeuksia sekä vCenteriin että varmistettaviin virtuaalikoneisiin, joten tätä varten tulisi tehdä käyttäjätunnus ja antaa sille tarvittavat oikeudet. Seuraavat käyttöoikeudet vaaditaan jokaiseen varmistettavaan virtuaalikoneeseen:

  • VirtualMachine->Configuration->Disk change tracking
  • VirtualMachine->Provisioning->Allow read-only disk access
  • VirtualMachine->Provisioning->Allow VM download
  • VirtualMachine->State->Create snapshot
  • VirtualMachine->State->Remove snapshot

Sekä seuraavat käyttöoikeudet itse VDR-applianceen:

  • Datastore->Allocate space
  • VirtualMachine->Configuration->Add new disk
  • VirtualMachine->Configuration->Change resource
  • VirtualMachine->Configuration->Remove disk
  • VirtualMachine->Configuration->Settings

Ja vielä lisäksi käyttäjällä on oltava seuraava käyttöoikeus koko ympäristöön:

  • Global->License

Kun käyttäjätunnukset on luotu voidaan siirtyä itse ympäristön asentamiseen. Lataa tarvittava asennus-cd VMwaren sivustolta tunnuksillasi. Käynnistä vCenter-koneella CD-levyltä asennus ja valitse Data Recovery Client Plug-In. Seuraa asennuksen ohjeita, kun asennus on päättynyt voit ottaa vSphere Clientillä yhteyden ja asentaa työasemaasi plug-inin valitsemalla vSphere Clientistä Plugins -> Manage Plugins. Kun tämä on asentunut tulee sinun käynnistää vSphere client uudelleen jotta voit ottaa asennetun lisäosan käyttöön.

Seuraavaksi asennetaan itse VDR-appliance. Avaa vSphere Clientistä File -> Deploy OVF Template. Valitse Deploy from File ja hae asennus cd:ltä hakemistosta X:VMwareDataRecovery-ovf (Jossa X on siis CD-asemasi tunnus) tiedosto VmwareDataRecovery_OVF10.ovf. Valitse haluamasi kohde, klusteri sekä datastorage, haluamasi levymuoto sekä aseta aikavyöhyke properties-välilehdeltä. Lopuksi tarkista että valinnat ovat haluamasi ja paina Finish jolloin vSphere alkaa asentaa VDR-palvelinta.

Kun itse VDR-palvelin on asennettu voidaan siihen lisätä varmistuslevy suoraan levyjärjestelmästä käyttäen VMwaren RDM-toiminnallisuutta jossa virtuaalikone kirjoittaa suoraan sille tarjottuun LUNiin ja tarjoaa näin paremman suorituskyvyn kuin virtuaalilevyllä.

Seuraavassa osassa näytän miten VDR-palvelimelle tehdään tarvittavat perusasetukset sekä miten palvelin määritellään suorittamaan varmistuksia.