I figure this is as good a place as any to talk about the challenges and many changes I had to undergo to get my networking setup in a way that helps me try out the tech. This is going to be a long one, as I want to help share my reasoning for the decisions I made. I’ll also share which configurations didn’t work and why I thought that was the case. Your comments and feedback will help me learn going forward.
I am by no means a “Networking Guy”. Yes, I have worked on Cisco, Ruckus Networks, and Extreme Networks routers switches in various jobs, I was never the point person responsible. I was filling in where I could to help with specific tasks. So this has been some really new ground for me. Throw in some Software-Defined Networking and my learning curve went vertical. For a HomeLab, Ubiquity offers some great products at a really good price point. As stated in a previous post, I am using the following networking products:
- Ubiquity UniFi Dream Machine Pro: Firewall, Gateway, Network Controller ($379)
- Ubiquity UniFi Switch 16 XG: Core / Aggregate Switch ($599)
- Ubiquity UniFi Switch PRO 24 PoE: Access Switch ($699)
- VMware NSX Data Center 3.1: Virtualized Networking and Security Stack (included in VMUG Advantage Membership, $200 / year)
I’ve also picked up the following to try and setup a “2nd Site” to try some of the multi-site features of NSX-T and vSphere:
- Ubiquity EdgeRouter 4: BGP-supported router ($199)
This was the first time to have managed switches in my home network, but I really like Ubiquity’s management interface on their UniFi line of products. Additionally, there are no subscription fees for their devices. All OS & software updates are included in the cost of the devices. This is a huge plus for my use case. Unfortunately, there are some definite down-sides to their UniFi line, not the least of which is lack of support for BGP. So, for the time being, I am using static routes for my 2 NSX-T Edge Transport Nodes.
I’ve looked into NVIDIA Cumulus VX for a free VM-based router, but since networking isn’t something I’m comfortable with, this was a learning curve too steep for now. Perhaps later I will go back and try it out. Most of the other virtual router options that support BGP that I found had a hefty subscription fee associated with them. What did I miss? Feedback welcome!
Design Considerations / Decisions
For this effort, I’m using the following design considerations:
- Use of vSphere Distributed Switches (VDS) 7.0.0 instead of N-VDS
- Use of static routes from Edge Transport Nodes to/from physical network
- NSX Management Nodes and Edge Transport Nodes hosted on physical servers (not embedded ESXi)
- 2 vSphere Clusters: 1 for Physical Servers, 1 for Embedded ESXi Servers
- MTU of 1700 on all switches with “Jumbo Frames” enabled on all Ubiquity switches
- I found the NUC’s NICs struggled with packets larger than 1700 bits. I will be testing the 10G adapter to see if I can increase that switch to 9000 and reduce fragmentation.
- Use a single T0 router within NSX
- All VM workloads will be connected to a single T1 router
- Initial configuration will focus on a single-site setup
- vSAN will be used on Embedded ESXi hosts
- iSCSI will be used on an as-needed basis on physical hosts
- FreeNAS VM on desktop is brought up to support servers
- Need to find an affordable storage solution so I can have a more permanent solution…
This image shows how the HomeLab is wired up:
I have a QNAP QNA-T310G1S Thunderbolt 3 to 10GbE Adaptor, Single-Port Thunderbolt 3 to Single-Port 10GbE SFP+ waiting at the post office for the NUC, but this is my current setup. The Dell has 2 10Gb and 2 1Gb RJ-45 connections. I’m using the built-in 1Gb LAN and 2 USB 1Gb NICs with the NUC. This is where the learning began. My original setup was:
- Dell R730
- 2 1Gb PNICs assigned to vSwitch0 for management traffic
- 1 10Gb PNIC assigned to SEO-DSwitch (Distributed Switch) for all production traffic
- 1 10Gb PNIC assigned to SEO-Storage-vDS (Distributed Switch) for iSCSI traffic
- Intel NUC
- 1 1Gb PNIC (on-board) assigned to vSwitch0 for management traffic
- 1 1Gb PNIC (USB) assigned to SEO-DSwitch for production traffic
- 1 1Gb PNIC (USB) assigned to SEO-Storage-vDS for iSCSI traffic
The Distributed Switches contained the following port groups:
- Management: VLAN used by VM workloads on the Management network (NSX Management Nodes, Edge Transport Nodes, vRealize Log Insight, etc.)
- Production: VLAN used by VM workloads (DNS servers, etc.)
- SEO-Mgmt-Trunk-DPG: VLAN Trunk containing VLANs for Management and vMotion for embedded ESXi VMs
- Trunk Port Group: VLAN Trunk containing full range of VLANs for embedded ESXi VMs
- SEO-Edge-Uplink: VLAN Trunk for Edge Transport Nodes. Contain the 2 VLANs for uplinks (2711 and 2712) and the TEP VLAN (1614)
- 1613-vSAN-DPG: VLAN used for vSAN on embedded ESXi hosts
- 224-iSCSI-DPG: VLAN used for iSCSI on physical hosts
- Stroage-Trunk-DPG: VLAN trunk containing both vSAN and iSCSI VLANs for the embedded host VMs
The Ubiquity switches were configured to have the following Switch Port Profiles:
- Management VLANs: SEO-Management (11), vMotion (12)
- Production VLANs: SEO-Management (11), vMotion (12), Production (42), TOR-SEO-01 (2711), TOR-SEO-02 (2712), SEO-TransportNode-TEP (1614)
- Storage VLANs: iSCSI (224), vSAN (1613)
The idea behind this setup is that I can physically separate (sort of–still same physical switch) the storage from production traffic. In NSX, I used the SEO-DSwitch for all NSX traffic so the PNICs could be maintained for the existing Port Groups and VMs. The Virtual ESXi cluster was given access to all of the required VLANs (hence why the management VLANs were thrown in with production on the switch port profile).
The Installation Journey…
I decided to setup the Virtual ESXi cluster first with NSX and deploy a couple of test VMs to ensure everything was working ok. The management nodes and edge transport nodes all deployed to the Dell (based on memory). When deploying, I used the same VLAN for the Transport Nodes and the Edge Transport Nodes. Building the network segments from the Segment level to the T0 Gateway worked as expected. I was able to deploy a small Ubuntu Server VM to the Virtual ESXi cluster and ping to/from the physical network. Feeling pretty stoked at this point!
Then I deployed the NSX bits to the Dell host so I could take advantage of some native compute resource. That’s when squirrely started to happen. As long as I didn’t attach a VM workload to the Dell on a NSX Segment, everything continued to work well. As soon as I attached a VM to an NSX segment, I started getting odd results. For instance, in some cases, the port for the VM would default to a “Blocked” state as shown here:
I was able to manually reset the port and get it up and running, but it involved logging into the host and using some CLI commands. Seemed like a bit of a pain that I didn’t want to deal with. Ended up doing a full wipe and reload of NSX to try and get a clean start. There had been a couple of tries up to this point where hosts may or may not have been fully cleared off. After undoing everything and shutting all of the VMs down I reboot the Dell (because of the capacity of the NUC, it’s main purpose in these events is to maintain 1 of my DNS servers so vCenter doesn’t freak out when starting up). The reinstall of NSX went well. Again, I deployed it only to the Virtual ESXi cluster and verified full end to end connectivity with the test VMs. The VLANs for everything stayed the same as the last time. A single TEP IP Pool and TEP VLAN for both Transport Nodes and Edge Transport Nodes.
This time when I added the Dell back to the mix, a new problem would appear. This time whenever a VM workload was added to a NSX Segment on the Dell, the tunnels to the Edge Transport Nodes would all drop. I logged into each of the ESXi hosts (physical and virtual) to test pinging each of them from each other using their vxlan interfaces:
vmkping ++netstack=vxlan <esxi-tn-vmkernel-ip> -d -s 1672
Everything worked as expected, but the tunnels still showed as being down between the Edge Transport Nodes and everything. I hit the Googles again (oh, those Googles…so helpful yet so much chaf to sort through) and found a reference that stated the Edge Transport Nodes need to be on a different TEP VLAN than the Transport Nodes. So I created a new VLAN: SEO-Edge-TEP (1615) and added it to the SEO-Edge-Uplink trunked distributed port group. This didn’t seem to solve anything. In fact, even though the two VLANs (1614) and (1615) were routable through the Ubiquity environment, none of the virtual TEP nodes could interact with the Edge Transport Nodes. All tunnels were down whenever any workload was started. Back to the Googles…
Found this Gem which explained that the VDS that the Edge Transport Nodes use for their Geneve traffic needs to be different from the VDS the hosts use for theirs. Hmmm….that means I have to free up a PNIC to create a new VDS. Well, who needed a separate Storage VDS for the Physical Hosts. I don’t have a storage solution as yet, so it’s not a big loss. The Virtual ESXi Cluster was still able to maintain access to the Storage VDS since the vmnics are using trunks anyway. That freed up a PNIC on both the NUC and the Dell. I created storage port groups for iSCSI and vSAN on the SEO-DSwitch so I could migrate the vmkernels from the NUC and Dell and not lose the ability to connect to my FreeNAS VM.
After setting this up and migrating the Edge Transport Nodes to a new SEO-Edge-Uplink DPG on the new SEO-Edge-DSwitch it came time to test the new setup. Again, my initial tests using the Virtual ESXi cluster went good. I was able to bring up a VM workload on two different hosts within the cluster and ping to/from the physical network. I brought up another small test VM workload on the Dell and *BANG* squirrely happens again! This time, I’m able to ping to/from the VM workload on the Dell from the physical network. The tunnels between the Edge Transport Nodes and the Dell stay up, but the tunnels between the Edge Transport Nodes and the Virtual ESXi cluster hosts drop. In the image below, 33.3 and 33.10 are Ubuntu VMs on the Virutal ESXi cluster. 33.5 is a RHEL VM on the Dell.
Talk about FRUSTRATING!!! Back to the Googles… After a lot of searching, I found another great blog entry that said a workaround for this problem is to setup a separate DVS for the Physical and Virtual hosts. This got me scratching my head. Uggh…ANOTHER VDS??? Ok, I’m REALLY happy I have 4 PNICs on the Dell. But as of today I only have 3 on the NUC. I created SEO-Physcial-DSwitch as my new VDS. for the Ok, once again, did a swap on PNIC assignment to VDSs. Until I get the 10Gb adapter for the NUC, I am going to have to keep the Edge Transport Nodes on the Dell only. Here is my final setup for the PNIC assignments:
After adjusting every aspect of the network (missed a couple of Ubiquity Switch Port Profile assignments the first time for the NUC), I was able to bring up workload VMs on the Virtual ESXi Cluster, the Dell, and the NUC on NSX Segments and ping to/from the physical network. Here is my list of VDS and DPGs / NSX Segments. There are a few extras since I’m toying with the vSphere with Kubernetes features of vSphere 7.0.1 now.
As I stated from the beginning, I’m not a networking guy. And although I grew up in the Linux CLI interface, I’m not comfortable working CLI on the NSX or ESXi hosts to pull packet captures and such. One day perhaps. Long story short, this is something that definitely takes some planning to setup in a home environment. Ironically, meticulous planning and HomeLab environments don’t always coincide. We like to play after all! I’ve linked the various resources that helped out throughout this post as well as provide a list of references after this. I would love to hear your comments or feedback on my setup.