Previous Thread
Next Thread
Print Thread
SSH Disconnects #34670 26 Oct 21 02:46 PM
Joined: Aug 2016
Posts: 329
J
John Andreasen Offline OP
Member
OP Offline
Member
J
Joined: Aug 2016
Posts: 329
Hi,

We have a couple of different customers where their ATE sessions are intermittently dropped. They receive the standard "SSH Rcv Status -20 (-43) SSH Channel unexpected recv error" message. One is hosted in the cloud, and the other has an on-site server. I have been working with the one in the cloud. In one instance, it appeared that there were no log entries in the server side ashlog to coincide with the disconnect, but in two other instances of getting disconnected, there were entries. I also connected to the server with ATE and PuTTY from the same machine. The PuTTY session was disconnected when ATE was. That being the case, it would seem to indicate that the problem is not necessarily ATE or A-Shell related.

Here is a sample of the server side ashlog from one of the instances that were logged:
Code
20-Oct-21 11:12:03 [tsk:21479-10]<SMENU:0xc438> SIGHUP trapped on: TSKAAJ (johna)
20-Oct-21 11:12:03 [tsk:21479-10]<SMENU:0xc438>  (Waiting for kbd wait or tcki to generate error; tinstate=-10000, rlock=0)
20-Oct-21 11:12:03 [tsk:21479-10]<SMENU:0xc4ec> Job 10 condev updated from ?ts/9:21479 to tsk:21479
20-Oct-21 11:12:03 [tsk:21479-10]<SMENU:(nil)> Job 10 condev updated from ?ts/9:21479 to tsk:21479
20-Oct-21 11:12:03 [tsk:21479-10]<SMENU:(nil)> jcbrebuild #4
20-Oct-21 11:12:03 [tsk:21479-10]<SMENU:(nil)> 20-Oct-21 11:12:03 [TSKAAJ,10,SMENU,johna,tsk:21479,totjobs=46] License recount(4)
  Was: 12P/12L, Is: 12P/12L (free passes: 0 tty, 0 ip, 0 ch), GrpUsg:0,0,0

20-Oct-21 11:12:03 [tsk:21479-10]<SMENU:(nil)>   Trigger: JCB record overwritten by 
20-Oct-21 11:12:03 [tsk:21479-10]<SMENU:(nil)>     device:?ts/9:21479, user:johna, job #10

20-Oct-21 11:12:08 [tsk:21479-10]<SMENU:(nil)>  Job in kbd wait; cleaning up qflock & exiting.
20-Oct-21 11:12:08 [tsk:21479-10]<SMENU:(nil)> Out: Nodes Remaining = 11P/11L, 9215 reads, 0 writes, 575 kbd bytes
20-Oct-21 11:12:08 [tsk:21479-10]<SMENU:(nil)>  After qpurge & qclose

Is there any particular ATE trace we could enable to gain more insight into why the connection is being dropped?

Any thoughts/suggestions on how to troubleshoot would be appreciated.

Thanks!
John

Re: SSH Disconnects [Re: John Andreasen] #34673 26 Oct 21 05:22 PM
Joined: Jun 2001
Posts: 11,645
J
Jack McGregor Online Content
Member
Online Content
Member
J
Joined: Jun 2001
Posts: 11,645
This is a difficult issue to pin down. I even have this problem intermittently connecting between my development machine and a VM running on the same machine, but I've yet to discover the trigger. One common cause is a router or VPN inactivity timeout, but in that case it would be pretty regular. (In my VM case, that's clearly not the issue, and the disconnect sometimes occurs within seconds of connecting, other times minutes or hours later.)

To further confound the situation, the server typically only detects the SIGHUP when it tries to write to the connection, so if the job is sitting at the dot prompt, or some kind of static input, it could be a long time before the disconnect is discovered. (Typically it would then fall to the IATIMEOUT assuming you have one. Note that INFLD prompts will send a probe every 15 seconds (assuming INFLD_KEEAPLIVE set), so in that case you would detect the situation sooner (and maybe even prevent it, if it's related to some kind of router or VPN timeout).

You log example above is fairly typical. Note that the "JCB overwritten" message is a bit of a red herring, since it is merely detecting that the SIGHUP handler changed the session connection device identifier from pts/9:21479 to ?pts/9:91479 as a way of broadcasting that the session has lost its client connection. But because you are presumably running with the default hangup response option, the process is allowed to continue running in background until it gets another error or needs input, which in your case happens a few seconds later.

The hangup response can get a lot messier when the session is waiting on a client-side GUI operation (especially XTREE) since it will likely spit out some errors relating to failure of the expected responses. And, from the client perspective, it may be that the disconnect occurs but neither side recognizes it, because the server is waiting on the client, and the client is waiting on user input, so neither one notices the break in the TCP link. My recommendation to avoid that is to set up IATIMEOUT on the client side to be a bit less than on the server side (and both of them less than any router/VPN-based timeout).

But unless your disconnects are clearly related to a fixed inactivity period, it's more likely that it's just some external problem causing the connection to drop, and there may not be any good way for either end to determine why. You look at the system logs on the server to see if there is a rash of disconnects at the same time, or some other errors that might be related, but I haven't been successful with that approach.

One site that was bedeviled with the problem a couple of years ago ended up launching a sidecar ping-type process with each session just to confirm that both the ATE and ping connections went down at the same time. (They were eventually able to convince themselves that it wasn't just ATE, and that there were fundamental reliability problems in their cloud connection to their server.) But you've already confirmed that with Putty (thank you!), so I'm not running out of suggestions.

Re: SSH Disconnects [Re: John Andreasen] #34833 15 Dec 21 03:22 PM
Joined: Aug 2016
Posts: 329
J
John Andreasen Offline OP
Member
OP Offline
Member
J
Joined: Aug 2016
Posts: 329
Hi Jack,

Thanks for the great information, and sorry for the delayed response.

When this was happening, we had the system tcp_keepalive_time set to 60 and tcp_keepalive_intvl set to 20. This seemed to solve a similar issue an another customer.

In this case, I set the "Keepalive Interval" in the ATE configuration to 60 on the two machines that were experiencing this issue the most. Those users reported that they were no longer getting disconnected after this. Recently, I set the ClientAliveInterval in the sshd_config file to 60 hoping this would solve the issue for all users. I have not gone back to the machines I manually configured and removed the ATE keepalive value yet. So, its hard to say whether this actually did cure the issue or it was some time of coincidence.

Re: SSH Disconnects [Re: John Andreasen] #34835 15 Dec 21 05:29 PM
Joined: Jun 2001
Posts: 11,645
J
Jack McGregor Online Content
Member
Online Content
Member
J
Joined: Jun 2001
Posts: 11,645
I like both of those approaches. Even if the underlying problem has nothing to do with timing out, setting the keepalives would at least let you discover the problem sooner. And if it is related to inactivity, server-side keepalive in the sshd_config file seems like the best (one and done), assuming you have access to it. In cases where you only have access to the client side, setting it in the ATE configuration is the only available option.

Re: SSH Disconnects [Re: John Andreasen] #35616 14 Sep 22 05:22 PM
Joined: Nov 2006
Posts: 2,192
S
Stephen Funkhouser Online Content
Member
Online Content
Member
S
Joined: Nov 2006
Posts: 2,192
Seemingly random SSH disconnects continue to be an issue, and they're difficult to differentiate from a valid disconnect.

Would the tunnel connection type be more robust from this type of disconnect?


Stephen Funkhouser
Diversified Data Solutions
Re: SSH Disconnects [Re: John Andreasen] #35618 14 Sep 22 05:28 PM
Joined: Jun 2001
Posts: 11,645
J
Jack McGregor Online Content
Member
Online Content
Member
J
Joined: Jun 2001
Posts: 11,645
Hard to say. It might well be that the SSH forwarder is more tolerant of network flakiness, since it isn't directly attached to a terminal session. On the other hand, it's hard to see how an actual break in the connection could fail to get reported to the server-side process as a SIGHUP if the forwarding is to work properly. Another possibility though is that the forwarders may keep up steady chatter between themselves, keeping the connection open better?


Moderated by  Jack McGregor, Ty Griffin 

Powered by UBB.threads™ PHP Forum Software 7.7.3