I've got a Fortigate firewall connecting over the internet via public IPs with a Bintec Router (or whatever the hell its called) via IPSec. It's for a customer of ours so they can access a server from one of their partner companies.
The Tunnel is actually active and stays up, but after the Phase 2 timeout (or close to the point where it would time out), data traffic ceases. I've confirmed this by running a steady ping from the Fortigate (using it's internal trusted interface as the source) to the destination server. Since a new ping gets sent every second, it basically counts how long data flows over the tunnel.
I've contacted the techs on the distant end and they seem to have more experience with IPSec tunnels, but they also don't have any concrete answers. We set the Key Life time up from 1,800 seconds to 18,000 seconds. Sure enough, 16,300 seconds later, the tunnel died again, wheras it used to die after about 1,600 seconds.
If I click the option "Bring Down" to kill the tunnel, the tunnel suddenly starts working again without actually going down.
I have the option for "Autokey Keepalive" on the Fortigate activated, but the other guys don't have any such option that I could ascertain. Also, Dead Peer Detection is activated in Phase 1.
TL;DR: Anyone have any ideas as to why my IPSec tunnel suddenly stops passing traffic, yet stays online, and instantly resumes sending traffic as soon as I reinitialize the tunnel by hand?
Posts
Second of all: Since I'm testing this connection with neverending Pings, can it be that the tunnel reactivates with TCP or UDP traffic and "ignores" ICMP? I ask because I've noticed that even though the tunnel goes down regularly, it comes back up on it's own, suggesting that maybe an initiated TCP session brought the tunnel back up or something...
In general, the devices will bring up the IPSEC tunnel when "interesting traffic" is observed as defined by the firewall device. So the answer to your question is: it depends. I'm not terribly familiar with the equipment being used (I'm primarily a Cisco guy), but I would expect the tunnel to go down if there were no traffic traversing it.
Is the tunnel going down actually causing a problem? Or does the session get built every time the client goes to access the server? Of course their first attempt might fail if the tunnel isn't built and doesn't get built before the application times out, but a retry should work just fine in that instance.
If you want the tunnel established permanently, you need to keep "interesting traffic" flowing across it. The way we do this for one of our clients is by running a routing protocol across the link (EIGRP in our case since it's pretty chatty).
I just checked it again, and the tunnel is back up and passing traffic. We set it up at around 10 this morning and it stayed up for 4.5 hours, then died for probably about the same amount of time, and now that I'm at home, it's up again. really weird.
I don't really have access to the internal devices on either side, so I can't tell if the firewall builds the connection again after idling it when something comes across. I don't even know if traffic originating from the firewall itself is considered "interesting" enough to rebuild the connection, and I cant really test TCP-based connections because I don't have the option to define a source address or interface when using the telnet or ssh commands on the Fortigate (though I do have the option to choose a source when pinging).
Somehow I think the tunnel is timing out, staying down for the next timeout period and then coming back up for the third timeout. Really strange shit.
I looked at the logs this morning and this is what I found:
The repetitive aggressive mode message #1's persist for about 45 minutes every 15 seconds until the log file ends. I was sleeping, so I obviously don't know if the tunnel was passing data, but message #11 shows my login this morning and when I pinged from that point in time the tunnel was passing data... So far the tunnel has been up for about an hour according to the timeout, and I logged in probably about an hour and 10 minutes ago for the first time today...
Also, I'm not sure if you meant to, but you left a valid IP addy listed in the provided log entries.
EDIT: Even though the average lifespan of an SA is typically pretty short in practice, I still recommend not using MD5 for HMAC, and rather SHA1 or even better SHA256 if it's available. This is especially important if you're doing U.S. federal GVT work (and even though you're in Germany this still may be the case ).
I haven't bothered to talk with the other guys yet, as I want to be able to thoroughly test my side to see what works and what doesn't. I set the phase 2 keepalive all the way down to two minutes (their side is still sitting at 18,000 seconds), and the connection has been up for about 15,000 seconds, with my side timing out every two minutes and renewing (that's what it does after a timeout, right?)
These settings are all complements of the customer, so I'm not gonna get into which encryption method is better with them
Other than that, I haven't tried running main mode, but I will do that as the next step if things continue to crap out...
Is PFS enabled on these endpoints? It occurs to me that if one end is and one end isn't, it could appear to cause the completion of phase 2 to "hang" if the unexpected extra step were being handled inappropriately by the endpoint not configured to use it. Something else to check out. I mean really we could go on and on and on about this kind of little stuff. Have you grabbed a packet capture of the negotiation?
In any event, good luck, and if you don't mind posting here or PMing me with what you do find as a resolution, I would appreciate it. I'm curious.
We face the same problem here. We did everything - debugging ant etc. but no success yet. Please share your knowledge if you manage to solve the problem.