During a recent incident I accidentally deleted a Tanzu Kubernetes Cluster which had Antrea CNI integrated with NSX. To my surprise, there was no way for NSX to identify that this cluster was not present anymore and all the cluster inventory information (nodes, namespaces, pods, etc.) were still visible in NSX manager UI under Inventory.

This is somehow make sense, since the integration between Antrea and NSX manager is managed by the Antrea interworking pods which are deployed during Antrea and NSX integration (revise this in my blog post HERE). Which means, NSX on its own is not actively pulling any state information from the cluster running Antrea, and the right way to remove the integration between Antrea and NSX would be by deploying the Antrea inventory cleanup manifest which is part of the Antrea deployment manifests that you download from VMware customer connect.

In this blog post I will be using NSX APIs to clean out a stale Tanzu Kubernetes Cluster from NSX inventory. This method can also be used to clean any stale Antrea clusters (not a must to be Tanzu).

Note: The method discussed in this blog post has been tested in a lab environment, if you face the same issue in production then engage VMware Global Support to confirm the procedure before attempting this on your production environment.

Lab Inventory

For software versions I used the following:

    • VMware ESXi 7.0U3d
    • vCenter server version 7.0U3g
    • VMware NSX-T
    • TrueNAS 12.0-U7 used to provision NFS data stores to ESXi hosts.
    • VyOS 1.4 used as lab backbone router and DHCP server.
    • Ubuntu 20.04.2 LTS as DNS and internet gateway.
    • Ubuntu 18.04 LTS as Jumpbox and running kubectl to manage Tanzu clusters.
    • Windows Server 2012 R2 Datacenter as management host for UI access.
    • Guest Tanzu Kubernetes Cluster hosted on Workload Management (TKGs).

For virtual hosts and appliances sizing I used the following specs:

    • 3 x ESXi hosts each with 12 vCPUs, 2 x NICs and 128 GB RAM.
    • vCenter server appliance with 2 vCPU and 24 GB RAM.

Ungracefully Deleting a Tanzu Kubernetes Cluster 

To simulate a stale Tanzu Kubernetes Cluster, I already deployed a TKC and integrated it with a NSX-T instance. My current TKC and NSX-T configuration looks like the following:

Checking further cluster details along with Antrea pods

Everything looks good from cluster perspective, now lets navigate to NSX Manager UI and verify cluster status:

Now lets deletes our hotdog cluster from cli and see if NSX Manager will reflect the change:

kubectl delete tanzukubernetesclusters.run.tanzu.vmware.com hotdog

Give it couple of minutes and then verify both from CLI and from vCenter (because I am running a guest TKC on top of vSphere with Tanzu) that your Tanzu Cluster is gone:

From vCenter I can see the delete events of hotdog TKC:

Give it couple of more minutes and then check the inventory again from NSX Manager UI, you should see that NSX Manager still sees the deleted cluster as UP and Running without any issues:

As expected, NSX Manager cannot proactively detect the status of Antrea nodes, since the integration and communication is done by means of the interworking pods which are now gone.

Removing the Stale Guest TKC clusters using NSX API

At this point, there is no option from NSX UI to remove our guest TKC and we need to clean it up via APIs in order to remove its info from NSX inventory.

In order to generate API calls towards my NSX Manager I am going to use cURL from my Ubuntu jumpbox, however you can use any other method you are comfortable with to generate API calls.

Step 1: Verify & delete guest TKC from NSX Inventory via APIs

I am logged in to my Ubuntu machine and I will run the following command against my NSX Manager:

curl -k -n --request GET https://nsx-l-01b/api/v1/fabric/container-clusters/

From the above output, the cluster is indeed still present in NSX inventory. We need to copy the cluster uuid (external id) and delete it from NSX inventory using the following API call:

curl -k -n --request DELETE https://nsx-l-01b/api/v1/fabric/container-clusters/007cb334-7dde-4c26-a61a-6d7f37f0c59a

The above command does not return an output but this means that it has been executed successfully, next step is to delete cluster control plane end-points from NSX database.

Step 2: Delete Cluster enforcement end-points

From my Ubuntu machine I will run the following command against my NSX Manager:

curl -k -n --request GET https://nsx-l-01b/api/v1/infra/sites/default/enforcement-points/default/cluster-control-planes

From the above output copy the path of the stale cluster control plane enforcement point as highlighted above, and delete it using the following command:

curl -k -n --request DELETE https://nsx-l-01b/api/v1/infra/sites/default/enforcement-points/default/cluster-control-planes/hotdog-tanzu-cluster?cascade=true

Step 3: Verify Cluster has been successfully removed from NSX inventory 

curl -k -n --request GET https://nsx-l-01b/api/v1/fabric/container-clusters/

Also from NSX UI navigate to Inventory > Inventory Overview and verify that there are no more Tanzu clusters or related info present

Hope you have found this blog post useful!