NSX Application Platform (NAPP) was introduced by VMware with the release of NSX 3.2, as the underlying platform for running various NSX features such as NSX Intelligence, Network Detection and Response (NDR) and Malware detection. NAPP components and the mentioned features do not run as OVAs anymore but in containers, this matches the whole change in VMware solutions offerings which either partially or fully making use of containers.
This being said, it means that to deploy NSX Application Platform one needs to have an up and running Tanzu (TKG or TCE) or upstream Kubernetes cluster in order to deploy NAPP and its components. NAPP features are offered in three form factors Evaluation, Standard and Advanced, the difference is in the applications that NAPP will deploy for you. In this blog post I deployed the Standard form factor (more about form factors HERE).
I have to be honest here, NAPP deployment is not a piece of cake and it took me 5 days to figure out some deployment concepts and to overcome couple of issues and limitations to get the platform up and running. In this blog post I will try to focus on deployment requirements, steps and gotcha’s in addition to a quick dive into he deployed NAPP pods to understand how NAPP under the hood runs.
For software versions I used the following:
- VMware ESXi 7.0U3f
- vCenter server version 7.0U3f
- TrueNAS 12.0-U7 used to provision NFS data stores to ESXi hosts.
- VyOS 1.4 used as lab backbone router and DHCP server.
- Ubuntu 20.04 LTS as bootstrap machine.
- Ubuntu 20.04.2 LTS as DNS and internet gateway.
- Windows Server 2012 R2 Datacenter as management host for UI access.
- Tanzu Kubernetes Grid on vSphere (TKGs) version 1.19.11
- NSX-T 184.108.40.206 as vSphere with Tanzu networking overlay.
- NSX 220.127.116.11 (medium deployment) for NAPP.
For virtual hosts and appliances sizing I used the following specs:
- 3 x virtualised ESXi hosts each with 8 vCPUs, 4 x NICs and 96 GB RAM.
- vCenter server appliance with 2 vCPU and 24 GB RAM.
We need to ensure that we have the following components installed and running before we proceed:
- Tanzu workload cluster (TKG or TCE) or an upstream Kubernetes cluster, any of these must be running any of the mentioned versions HERE
- In my lab I used TKGs versions 1.19.11, to learn how to deploy a TKGs cluster you can reference my previous blog post HERE
- A loadbalancer service deployed in your Tanzu or K8s cluster, since I make use of TKGs with NSX networking I did not need to explicitly deploy a loadbalancer in my Tanzu cluster (NSX takes care of this). If your Tanzu or K8s cluster does not run on NSX then you need to deploy a loadbalancer for a service interface that will be used by NAPP later on. For this I would recommend either NSX ALB (Avi) or if you are looking for an open-source option then Metallb is very good option (https://github.com/metallb/metallb/releases)
- DNS up and running.
- Internet connectivity for NSX and Tanzu/Kubernetes nodes to download helm charts needed by NAPP deployment.
Preparing our Tanzu TKGs cluster for NAPP deployment
Since as mentioned under the overview section I am deploying NAPP in Standard form factor, I needed to deploy my TKGs workload cluster with control plane node of class medium and worker nodes in class large, below is my TKC creation YAML file I used to create my TKGs cluster:
apiVersion: run.tanzu.vmware.com/v1alpha1 kind: TanzuKubernetesCluster metadata: name: napp-tanzu-cluster namespace: vexpert-ns spec: topology: controlPlane: count: 1 class: best-effort-medium storageClass: napp-storage-policy workers: count: 3 class: best-effort-large storageClass: napp-storage-policy distribution: version: v1.19.11 settings: network: cni: name: calico #Use Antrea or calico CNI pods: cidrBlocks: - 18.104.22.168/16 #Must not overlap with SVC services: cidrBlocks: - 22.214.171.124/12 #Must not overlap with SVC
Some remarks on the above YAML:
- The namespace and storage classes are already defined in my environment (reference previous post HERE)
- I used calico as CNI instead of the default Antrea due to some issues in my lab with Antrea but it is not a must to use calico.
- If you do not specify the Pod and Services CIDRs in your TKC yaml, then CNI by default uses subnet 192.168.0.0/16 for Pod CIDR, this caused an issue in my lab as this is the same subnet I use for my infra management network, so I would recommend to manually define the above section in your TKC yaml as I did.
- The version must match one of the supported versions by NAPP (see above).
Once your yaml is ready, apply it as follow to start creating your TKGs cluster (details of doing so can be found HERE)
kubectl apply -f <deployment_filename.yaml>
vSphere will start deploying the control plan and worker nodes VMs and build the TKG cluster for you, give it sometime and then check the available Tanzu Kubernetes clusters, you should see our napp-tanzu-cluster created.
Kubectl get tanzukubernetesclusters.run.tanzu.vmware.com
Now, login to our TKGs cluster (napp-tanzu-cluster) and verify that all nodes are Ready:
kubectl vsphere login --server=<supervisor cluster ip> --tanzu-kubernetes-cluster-name napp-tanzu-cluster --tanzu-kubernetes-cluster-namespace vexpert-ns -u email@example.com --insecure-skip-tls-verify
Now our Tanzu cluster is ready, last step is to generate Kubeconfig file as described HERE. This kubeconfig file will be used by the NAPP deployment process to connect to our TKGs cluster and start deploying NAPP pods.
Preparing our Infra (DNS) for NAPP deployment
Before we jump to the NAPP deployment, we need to ensure that we have 2 available IPs in our IP pool configured on our loadbalancer. As mentioned in the beginning of this blog post, NAPP will require to have a loadbalancer service deployed in our Tanzu/Kubernetes cluster so that it can provision service for external traffic accessing the platform pods created (this will be called the service interface through which NAPP will access feature Pods from external network).
In addition to the service interface, NAPP requires another free IP address from the same pool. If you are using NSX ALB then pick any free two IPs from the assigned IP Pool for your VIPs, if you are using Metallb then pick any free two IPs from the configured IP block. Finally, if you are using NSX-T as in my case then pick 2 free IPs from the external service pool that NSX has created during the process of enabling the workload management in vSphere (reference my previous blog post HERE)
What I did, I logged in to my another NSX instance (126.96.36.199 as mentioned under lab inventory) which is providing networking for my supervisor cluster and TKGs, navigate to Networking > IP Address Pools and highlight the Pool which is used for assigning IPs to services (ingress CIDRs)
In my case, it is 188.8.131.52/24. You can also get this information from vCenter, navigate to your cluster which has Workload Management enabled then under Configure > Supervisor Cluster > Network and expand workload Network
In my case, I picked 184.108.40.206 for the service interface and 220.127.116.11 for the messaging service interface and added them as AAA record to my DNS server:
18.104.22.168 will resolve to napp-svc.corp.local and 22.214.171.124 to napp-msg-svc.corp.local.
NSX Application Platform Deployment
Last step is to navigate to our NSX instance (I used NSX 126.96.36.199 but you can use 3.2.1 onwards, I would not recommend 3.2).
Under System, NSX Application Platform click on Deploy NSX Application Platform.
For the below. if your deployment is air gapped and does not have internet connectivity then you need to add private Helm and Harbor repositories and upload NAPP images to them. Below is the default URLs and requires internet connectivity.
Next, you need to upload your kubeconfig file that we created earlier in this post and fill in the FQDN entries for both service interface (napp-svc.corp.local) and messaging service (napp-msg-svc.corp.local) as explained earlier.
Click on Next and run all the pre-checks and ensure that ALL pre-checks succeed.
Confirm the deployment configuration and click on Deploy to start the deployment process.
It took me almost 2 hours to finalise the NAPP deployment as it is quite a lot of components.
Important note: I noticed that the deployment might through failures during the platform (at 40%) and metric server deployment (at 80%) you should be able to click on the Retry button and get through those issues. After some investigation, apparently NAPP platform pods creation takes a lot of time and has a lot of inter-pod dependencies (pods waiting for other pods to complete) which takes more than the UI wizard timeout resulting the UI to through failure of deployment errors while in fact, the pods are just taking quite some time to be created on the underlying Tanzu/Kubernetes cluster.
Once the deployment is done, you should see the below screen and the ability to deploy any feature you like corresponding to the form factor you chose.
NAPP deployment verification
In order to verify or troubleshoot NAPP deployment, you need to familiarise yourself with the components which are deployed by the NAPP.
NAPP creates the below highlighted namespaces on your Tanzu Cluster:
Cert-manager is where cert-manager pods are created (manages NAPP pod certs creation and handling) while projectcontour is for contour deployment which is an open-sopurce project for ingress.
The NAPP platform itself is deployed in nsxi-platform namespace, for a finalised and successful deployment, the output of kubectl get pods -n nsxi-platform should look like the below:
Deploying NAPP is not a straight forward as it looks above, you will come across some challenges depending on the environment you are deploying NAPP in, however it is a great learning opportunity and nevertheless the features offering of NAPP is awesome and a must-have (in my opinion).
Pingback: Deploying NSX NAPP on upstream (a.k.a native) Kubernetes – Part I - nsxbaas
Hi sir, i have deployed tkgs with nsxt but everytime i try to create pod, it always stuck on insufficient resources with failed scheduling eventhough all node is ready. I deploy it in my home lab
Probably your nodes are not satisfying the sizing requirements for the NAPP deployment type you are trying to deploy. Please revise NAPP release notes and system requirements.
It is strange due to my physic is 128gb memory work with vmware workstation and manage for nested vm. I install host with memory 64gb and 10vcpu each, vyos router connected well with bgp route to T0-GW. I have 3 host running as compute(worker) node and 3 vm(supervisor) as master node deployed automatic with workload management gui from vcenter. All seems fine running with green notif while i check pod in kube-system and etc running well automaticly deployed from the first time. The only thing is everytime i create pod it always stuck on that error, i have stuck in here with no clue:(