So, there I was…
How many bad stories started out with those very words? After my last setup article, I was having some very quirky behavior with my network infrastructure. Things like, deploying Workload Management on my virtual vSphere cluster wouldn’t successfully deploy. I could ping everything fine and all of the tunnels in NSX-T showed up as shown here:
Strange huh? I did traceroutes in NSX-T and verified all of the traffic was making it in and out of the environment to my physical network using ping. Here is a simple diagram of the state of my TEP network connections between the physical, virtual, and edge transport nodes:
Per the many setup guides I found online, I used separate vlans for the Edge Transport Nodes and Transport Nodes. This separation took place when I had all of the nodes on the same vSphere Distributed Switch (vDS). At first, I had all of the nodes connected to the same vDS (SEO-DSwitch). This worked as long as only the virtual hosts and the Edge Transport Nodes were up. As soon as I threw in the physical hosts, I had to keep searching. Eventually, I ended up with the configuration shown above where the physical nodes, virtual nodes, and edge nodes are on separate vDS. The separation of vlans stayed throughout. And as a result, all of the tunnels came up and it looked like I had it beat. But I didn’t…
It turns out that as long as all of my workloads did not leave the single T1 gateway within NSX, most things worked as advertised. It was when the T0 gateway had to pass traffic that problems started appearing. After a lot of failed Google searches, I turned to Reddit to see if the community could help:
The comments had me relook at the MTU across the board. All of the MTU settings for the vDS were set to at least 1700 (I can’t go much higher because of the USB NICs on the NUC):
I also ensured I enabled Jumbo Frames on both of the Ubiquity switches. I found I could run the following command from desx1 and get a successful response from the TEP addresses on the NUC and the virtual ESXi hosts:
vmkping ++netstack=vxlan 172.27.14.<x> -d -s 1672
However, whenever I tried doing the same to the Edge Transport Node TEP addresses, it would fail. In fact, I had to reduce the packet size to 1472 to get a successful response. I couldn’t see anywhere this was being blocked. So I went for a walk to clear my head and not throw my nice computer out the window.
You know, maybe I should go for walks more often. Because as I was walking, I realized something: Jumbo Frames are a Layer 2 feature. The route between the two TEP vlans is a layer 3 thing and probably what’s limiting everything to 1500 bytes.
So what about the need for the Edge Transport Nodes to be on a different vlan? Well, some routers have the ability to up the MTU for routed links. Unfortunately, I haven’t found that feature in the 6.0.43 version of the Ubiquity Network software. I thought back to the justification folks were using to move the edge nodes elsewhere and why my physical host was having issues. It turns out that with the separation of vDS for all 3 transport node groups (with their own vmnics assigned), we can use the same vlan without any issues! Do we have a possible fix??? I made the following changes to my configuration:
After making all of those changes, I waited for NSX-T to report that the tunnels were back online. Once I had green lights across the board, I went straight to testing. First off was to test a large ping from desx1 to the edge TEP interfaces:
Next was to test he ability for a VM sitting on a NSX segment to ping out appropriately and do a DNS query. This was a failed test previously:
With everything looking good across the board, I went to redeploy the Workload Management in vCenter and sure enough. IT WORKS!!!! Here is the new and improved TEP architecture in my network now after these changes:
So it turns out the issue was with the jumbo frame packet not making it across the physical infrastructure without fragmenting. As stated in the VMware documentation, this usually isn’t an issue, but it is with the NSX-T GENEVE tunnels. As I stated earlier, there are vendors who support routing jumbo frames. And I have found a few community posts stating Ubiquity supports this feature too in a limited way in order to support some PPPoE connections.
I learned A TON throughout this process. Google didn’t help me out as much as I would have hoped. PROBABLY because most of the folks setting up NSX-T have a good understanding of networking and would have seen the Layer 2/Layer 3 conflict right away. For the rest of us, I leave you with this helpful blog article and the frustration of weeks of my life finally floating away. With my infrastructure settled, I’m now able to really start working with the cool toys here.
Like and share this post if you find it helpful, informative, or know someone who has been struggling!