Endarkenment: 2016

Was working with a customer's outsourcing provider to establish a set of meshed VPNs. Both the outsourcer and the customer have complex Network environments, requiring double-NAT'ing either side of the VPN, so when things didn't work, we settled down for some troubleshooting.

First the VPN wasn't coming up at all...bread-and-butter, usually routing between the endpoints (assuming your auth and communities are right). This was a little bit more interesting as it turned out the outsourcer had two firewall clusters on the same external segment, and when they looked at the logs, the inbound IKE traffic was hitting the wrong cluster. Simple enough, routing on their border router, right? Turns out wrong - the routing table was fine. The necessary clue was provided when network support at the outsourcer said "the VRRP address of this cluster is...". Turned out that the clusters shared the same VRID (Virtual Router ID - VRRP is RFC 3768, now obsoleted by Version 3 for IPv4 and IPv6 in RFC 5798), which meant that they shared the same virtual router MAC address, which is how traffic that was routed correctly was ending up on the wrong firewall. The outsourcer changed one of their VRIDs, and we were good to go!

Well, not quite...the VPN now came up and the firewalls at both ends logged the customer's traffic, however the customer couldn't connect to the web app. Turned out the SYN was going out, no SYN-ACK coming back! Outsourcer checked their side and the SYN-ACK was getting back to the firewall. Had them do an "fw monitor -p all" and the packet was reaching the encrypt chain. However we never saw any traffic at the customer end. Further inspection showed that there were never any Phase 2 renegotiations from the outsourcer's end either. When we did the tcpdumps at our end, it was

...transmission interrupted...

Endarkenment

Monday, December 05, 2016

Legacy post rescued from Drafts: Troubleshooting Check Point VPNs