Overview

Backup and restore is the main building block in any organisation’s disaster recovery policy and since containerised workloads are no longer a short-lived workloads that are only running in a development environments, but actually became almost the standard for running production applications, it is now crucial  to design and implement a solid backup and restore solution for Kubernetes.

The challenge is that, traditional backup and restore tools are built to handle physical/traditional virtual machines and has no mechanisms to deal with Kubernetes, containers, persistent volumes and other building blocks of modern containerised workloads. In this blog post I am going to discuss how organisations can implement and use Velero as a backup/restore solution for their Tanzu Kubernetes workloads and I will try to summarise the different options available (along with the differences) when using Velero. 

What is Velero?

Velero (which is previously known as Heptio Ark) is an open-source project providing organisations with tools to backup and restore Kubernetes cluster resources and persistent volumes. Velero can be implemented with Cloud providers offering Kubernetes platform for your organisation or can be implemented as part of an on-prem deployment. Some of Velero use cases are:

  • Take backups of your cluster and restore in case of loss.
  • Migrate cluster resources to other clusters.
  • Replicate your production cluster to development and testing clusters.

Velero consists of a server component which will run as a deployment in your Kubernetes cluster and a command line tool which is used to interact with Velero server APIs. 

Velero and VMware Tanzu

Velero can be used to back-up VMware Tanzu Kubernetes Grid Services guest clusters (TKGS) as well as Tanzu Kubernetes Grid workload clusters (TKGm) however there are major differences on how back-up and restore work and offer in both cases, below I list some of the major differences and key takeaways if you decide to use Velero to backup your Tanzu clusters.

Velero in Tanzu Kubernetes Grid Services (vSphere with Tanzu)

Customers who are implementing Tanzu Kubernetes Grid Services workloads (aka vSphere with Tanzu/Workload Management) can use Velero to backup their Tanzu workloads using one of the following methods:

Backup and Restore TKG Cluster Workloads on Supervisor Using Velero Plugin for vSphere

This method requires customers to have enabled workloads management (supervisor cluster) on top of NSX networking, since in this mode Velero pods will be deployed as vSphere Pods which are only possible when vSphere with Tanzu is enabled on top of NSX networking and NOT VDS networking. 

Advantage of using vSphere Velero plugin is that customers can perform back-up and restore operations for:

  • Customers can leverages vSphere snapshots to perform snapshots for persistent volumes present in their workload clusters.  For each volume snapshot, a Snapshot Custom Resource is created in the same namespace as the persistent volume claim (PVC) that is snapshotted.

The gotchas in this scenario are:

  1. Your supervisor cluster and vSphere Namespaces need to be running on NSX networking which not all Tanzu deployments might be leveraging. 
  2. Your restore operation must be done to a Tanzu cluster running on NSX networking as well i.e. has the Velero plugin installed.

If the above conditions are not satisfied in your Tanzu deployment you need to look into the following option.

Backup and Restore TKG Cluster Workloads on Supervisor Using Standalone Velero

Customers who do not have NSX as networking provider for their Tanzu clusters can still leverage Velero to perform Tanzu workloads back-ups and restores using restic open-source back-up module. When you deploy Velero and instruct it to use restic as back-up module, Velero will perform File System Backup level for Persistent Volumes present in the cluster/namespace that you want to back-up. 

As everything in life, File System Backup brings advantages but also disadvantages, let’s start with the advantages of using standalone Velero and Restic File System Backups to back up your Tanzu Grid clusters:

  • Portability of your back ups and restores, which means you can restore your back ups to any Kubernetes platform running Velero.
  • File system backup is capable of backing up and restoring any type of Kubernetes volume, which means if your volume does not have a native snapshot mechanism then FSB is the solution for you.
  • It is not tied to a specific storage platform, so backs can be stored in a different storage type other than the one backing the Kubernetes volumes.

Now for the disadvantages you might need to consider the following:

  • Standalone Velero cannot back up Supervisor VMs.
  • Restic FSB backs up data from live file system with no quiescing mechanism and hence backed up data is less consistent than snapshot approach.
  • FSB requires root access from the mounted hostpath directory, so the pods need to run as root user, which for some environments may not be allowed.

Backup and Restore TKGm Cluster Workloads Using Standalone Velero

For Tanzu Kubernetes Grid clusters (Management and workload clusters) the only possibility is to leverage standalone Velero and Restic to perform back-up and restore operations, however in this scenario Tanzu Management cluster can be backed up using Velero modules.

Lab Inventory

For software versions I used the following:

    • VMware ESXi 8.0U1
    • vCenter server version 8.0U1
    • Velero 1.9.7
    • TrueNAS 12.0-U7 used to provision NFS data stores to ESXi hosts.
    • VyOS 1.3 used as lab backbone router and DHCP server.
    • Windows Server 2019 as DNS server.
    • Windows 10 pro as management host for UI access.

For virtual hosts and appliances sizing I used the following specs:

    • 6 x ESXi hosts each with 12 vCPUs, 2 x NICs and 128 GB RAM.
    • vCenter server appliance with 2 vCPU and 24 GB RAM.

Deployment Workflow

  • Deploy an object store as Velero backups location.
  • Install Velero cli and deploy Velero resources in TKGS guest cluster.
  • Perform and verify backup operation for a namespace.
  • Simulate a DR scenario and verify restored data.

Deploy an Object Store for Velero Backups

Velero requires an S3 like object store in order to store backups to, this does not mean that you need to use AWS S3 for that but any object store will do. In my setup I used MinIO which provides an S3-like and compatible API. You can deploy MinIO using binaries directly to a server/local machine or as a container image in docker or podman, details of how to install MinIO can be found HERE

In my lab I deployed MinIO as docker image using the following docker compose YAML (you will have to have docker and docker-compose installed prior to running the below file)

version: '3'

services:
  minio:
    image: minio/minio
    ports:
      - "9000:9000"
      - "9001:9001"
    volumes:
      - minio_storage:/home/nsxbaas/minio/data #this is a pre-created directory which is used by MinIO to store backups
    environment:
      MINIO_ROOT_USER: nsxbaas
      MINIO_ROOT_PASSWORD: password_of_your_choice
    command: server --console-address ":9001" /data

volumes:
  minio_storage: {}

Once I deployed the above as docker image I was able to access the UI of MinIO using the above credentials

After that you will to create a R/W bucket (in my setup I created a bucket called tanzu)

You will also need to generate access keys which Velero will use to authenticate to MinIO to perform backup and restore operations. You can create your access id/key pair from Access Keys options on the left pane under User.

At this point, MinIO is ready to be used as our object store.

Install Velero CLI & Deploy Velero in TKGS Guest Cluster

Step 1: Download the Velero CLI

Download Velero from the Tanzu Kubernetes Grid product download page at the VMware Customer Connect portal you then need to upload the gzip archive to a Linux machine (in my setup I use a linux jumpbox) and follow the instructions in VMware documentation to extract and copy Velero cli to your local PATH.

If Velero CLI is installed correctly you should see similar output as the below

Step 2: Deploy Velero in your Guest Cluster

For this step, you need to create a file called credentials-minio (name must be exactly as this) and paste in it your S3 access ID/Key we generated earlier in MinIO UI

[default]
aws_access_key_id =  1K7OupGVfoaWYfmHs23i
aws_secret_access_key =  by4mECTMyMRei1yukHjPUfXcuChINYNp80tlHVn3

At this point we are ready to start deploying Velero in our TKGS guest cluster, please note that the below command syntax I am using is valid up till Velero version 1.9.7, starting from Velero 1.10.x the syntax has changed especially for the restic part of the command, you can reference Velero open source documentation for the most recent command syntax.

To start Velero deployment, login to your TKG guest cluster and run the following command after modifying URL address to match your environment:

velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.0.0 \
--bucket tanzu \
--secret-file ./credentials-minio \
--use-volume-snapshots=false \
--use-restic \
--backup-location-config \
region=minio,s3ForcePathStyle="true",s3Url=http://services-linux.nsxbaas.homelab:9000,publicUrl=http://services-linux.nsxbaas.homelab:9000

This will kickstart Velero deployment and you should see similar output to the below

As you can see Velero and restic pods are all in running state (ignore the one in Pending this is due to some insufficient resources in my lab but has no effect on backup or restore operations).

Perform Backup & Restore Operations for a Namespace

Before we go straight to the command line, it is may be interesting to learn a bit of how Velero performs a backup operation for a namespace or a cluster. Velero can perform on-demand and/or scheduled backups for your Kubernetes resources, the workflow is the same for both and is triggered by either velero backup command (on-demand) or a pre-configured schedule, for both the workflow can be summarised as follows:

  1. The Velero cli backup command makes an API call to the KubeAPI server in order to create backup object.

  2. The Velero backup controller (module) will see the backup object created and performs validation.

  3. Velero then begins the backup process. It collects the data to back up by querying the API server for resources.

  4. Velero backup module makes a call to the object store (MinIO in our case) to upload the backup file.

Step 1: Perform Back-up Operation for microservices Namespace

I will be perform a backup operation for a namespace called microservices using the following command:

velero backup create velero-backup01 --include-namespaces microservices

If the backup operation is successful then you should see similar output to the below when you use the command velero backup describe <backup job name>

From the above output you can see that the backup job has been completed successfully.

Step 2: Delete microservices Namespace and restore it from backup

To simulate a DR scenario, I will delete the microservices namespace and will use Velero to restore it from backup

In the above screenshot, notice the AGE of microservices namespace is 88 days, after I perform the restore operation it will show AGE in seconds since I have deleted the namespace and restored it from backup as shown also in the screenshot as well.

The above output is taken after the restore operation where you can see that the microservices namespace has been recreated 87 seconds ago and you can see the pods/containers status being created and deployed.

Hope you have found this blog post useful!