This is a difficult issue to pin down. I even have this problem intermittently connecting between my development machine and a VM running on the same machine, but I've yet to discover the trigger. One common cause is a router or VPN inactivity timeout, but in that case it would be pretty regular. (In my VM case, that's clearly not the issue, and the disconnect sometimes occurs within seconds of connecting, other times minutes or hours later.)
To further confound the situation, the server typically only detects the SIGHUP when it tries to write to the connection, so if the job is sitting at the dot prompt, or some kind of static input, it could be a long time before the disconnect is discovered. (Typically it would then fall to the
IATIMEOUT assuming you have one. Note that INFLD prompts will send a probe every 15 seconds (assuming
INFLD_KEEAPLIVE set), so in that case you would detect the situation sooner (and maybe even prevent it, if it's related to some kind of router or VPN timeout).
You log example above is fairly typical. Note that the "JCB overwritten" message is a bit of a red herring, since it is merely detecting that the SIGHUP handler changed the session connection device identifier from pts/9:21479 to ?pts/9:91479 as a way of broadcasting that the session has lost its client connection. But because you are presumably running with the default
hangup response option, the process is allowed to continue running in background until it gets another error or needs input, which in your case happens a few seconds later.
The hangup response can get a lot messier when the session is waiting on a client-side GUI operation (especially XTREE) since it will likely spit out some errors relating to failure of the expected responses. And, from the client perspective, it may be that the disconnect occurs but neither side recognizes it, because the server is waiting on the client, and the client is waiting on user input, so neither one notices the break in the TCP link. My recommendation to avoid that is to set up IATIMEOUT on the client side to be a bit less than on the server side (and both of them less than any router/VPN-based timeout).
But unless your disconnects are clearly related to a fixed inactivity period, it's more likely that it's just some external problem causing the connection to drop, and there may not be any good way for either end to determine why. You look at the system logs on the server to see if there is a rash of disconnects at the same time, or some other errors that might be related, but I haven't been successful with that approach.
One site that was bedeviled with the problem a couple of years ago ended up launching a sidecar ping-type process with each session just to confirm that both the ATE and ping connections went down at the same time. (They were eventually able to convince themselves that it wasn't just ATE, and that there were fundamental reliability problems in their cloud connection to their server.) But you've already confirmed that with Putty (thank you!), so I'm not running out of suggestions.