Previous Thread
Next Thread
Print Thread
Getting random SIGHUPs #11498 27 Nov 18 08:38 AM
Joined: Jan 2003
Posts: 128
D
Dominic - Madics Systems Ltd Offline OP
Member
OP Offline
Member
D
Joined: Jan 2003
Posts: 128
This is probably not an ashell issue but just in case anyone has seen the same before.

This is happening on two remote sites of a customer but not on the local site. It started happening two days ago. Normal network diagnostics are not getting anywhere (show no problems).

Was wondering is there could be a false positive for the SIGHUP - a SIGHUP that is not a SIGHUP or similar.


27-Nov-18 14:13:55 [p9175066-j0]<> jcbrebuild #0
27-Nov-18 14:16:03 [p9633834-j0]<> jcbrebuild #0
27-Nov-18 14:19:06 [tsk:10551428-j0]<> jcbrebuild #0
27-Nov-18 14:21:58 [tsk:7602254-j0]<> jcbrebuild #0
27-Nov-18 14:23:58 [tsk:7405796-j0]<> jcbrebuild #0
27-Nov-18 14:26:30 [p7798786-j3] SIGHUP trapped on: TSKAAC (yjane)
27-Nov-18 14:26:30 [p7798786-j3] (Waiting for kbd wait to generate e
27-Nov-18 14:26:30 [p7798786-j3] (Now in kbd wait; setting basic erro
27-Nov-18 14:26:31 [p7798786-j3] jcbrebuild #3
27-Nov-18 14:26:31 [p7798786-j3] 27-Nov-18 14:26:31 [TSKAAC,3,MADICX,y
Was: 20P/21L, Is: 20P/21L (free passes: 0 tty, 1 ip)

Re: Getting random SIGHUPs #11499 27 Nov 18 10:43 AM
Joined: Sep 2002
Posts: 5,450
F
Frank Online Content
Member
Online Content
Member
F
Joined: Sep 2002
Posts: 5,450
Maybe Steve is creating some gremlins wink

Re: Getting random SIGHUPs #11500 27 Nov 18 11:07 AM
Joined: Jun 2001
Posts: 11,645
J
Jack McGregor Online Content
Member
Online Content
Member
J
Joined: Jun 2001
Posts: 11,645
Hmmm... I'm not sure what to make of this but here are a few thoughts...

1. I've never heard of a false positive SIGHUP coming from some OS-level glitch. But it is certainly possible that an A-Shell or OS-level process is explicitly sending SIGHUP signals via the UNIX kill command, so I guess a "rogue agent" is impossible to rule out. (Note that at one point in the distant past, KILL.LIT was even guilty of sending SIGHUP to terminate jobs; that was eventually changed though to sending SIGKILL instead. Also note that when A-Shell sends a signal to another job via MX_KILL , a message "MX_KILL signal ## sent to pid #####" will be written to the ashlog file.)

2. I have seen cases of cron scripts that crank up every so many minutes, ostensibly looking for CPU hogs or zombies and end up sending out a barrage of SIGHUP or SIGKILL signals. The telltale sign of that is usually a bunch of SIGHUP messages in the log at the same time, repeating at a fixed interval.

3. With that in mind, it might be useful to see a somewhat larger excerpt of the log, both to check for such clusters, and also to get a sense of how well the jobs receiving the SIGHUPs are shutting themselves down. (I know that's not the issue here, but it often is, so it's a good opportunity to review. Typically what I'll do is locate the SIGHUP message, then search from there for the pid (in the bracketed part of each trace prefix) to see if the job goes through the entire sequence, ending up with "After qpurge & qclose", and how long that takes.

4. It's hard to tell from the excerpt if the SIGHUP is occurring after the job had been running for awhile, or if it was during the startup. (The "Was: 20P/21L, Is: 20P/21L" in the final trace suggests that there was no change in the physical/logical job count as a result of the SIGHUP, which might mean that the job never got into the job table to start with, or the "jcb rebuild" (more like a "rescan") occurred before the job had exited.

5. What version/platform is this?

6. I once wrote a utility to convert an ashlog.log to a kind of spreadsheet of sessions to allow for a kind of overview, including information on how many sessions terminate with errors or signals. But it's kind of a work in progress due to the constantly evolving information in the ashlog. (This is one of the ideas under the category of "system health reports" that we touched on at the Conference but didn't really resolve.) Usually the issue comes up in case like this where you are interested in a specific statistic - the number or frequency of SIGHUPs in this case. And that should be fairly easy to get by scanning your ashlog for indicators of session start, finish, and termination, such as:

Normal (but rather short) session...
Code
27-Nov-18 05:20:23 [p21317-23]<:(nil)> In: Nodes=11/31/55 [P], ip=192.168.20.205 d8:9e:f3:6:9a:43, (dave)
...
27-Nov-18 05:20:36 [p21317-23]<HOST:0x3da> Out: Nodes Remaining = 10P/30L, 15 reads, 1 writes, 140 kbd byte
SIGHUP with normal recovery/termination...
Code
27-Nov-18 04:55:38 [tsk:20349-18]<BXINA2:0x436a> SIGHUP trapped on: TSKAAR (steak)
...
27-Nov-18 04:55:43 [tsk:20349-18]<MASTMU:0x47c6>  After qpurge & qclose
But, if you are sufficiently motivated to want to tinker with a spreadsheet treatment of the ashlog in order to gather statistics, look for anomalies, etc., that might inspire me to dig out the routine to let you play with it. (Full disclosure/warning: it's the kind of thing that can easily suck up many hours gathering, analyzing, refining, etc. which might be interesting but aren't necessary that productive.)

Re: Getting random SIGHUPs #11501 03 Dec 18 05:46 AM
Joined: Sep 2003
Posts: 4,135
Steve - Caliq Offline
Member
Offline
Member
Joined: Sep 2003
Posts: 4,135
After Dominic running around in circle for a few days the customers I.T person decided to upgrade the firm on their Router and this then magically fixed it.
They said "upgrade" we do wonder if it was really an unspoken of "downgrade"

Re: Getting random SIGHUPs [Re: Dominic - Madics Systems Ltd] #34180 28 Apr 21 03:23 PM
Joined: Nov 2006
Posts: 2,192
S
Stephen Funkhouser Online Content
Member
Online Content
Member
S
Joined: Nov 2006
Posts: 2,192
Curious if the utility you mention here has gotten any unreported attention. Having users report disconnects, and then having to try to manually parse the ashlog to determine what disconnects should be treated as ordinary termination vs abnormal is quite difficult. Not to mention time consuming.


Stephen Funkhouser
Diversified Data Solutions
Re: Getting random SIGHUPs [Re: Dominic - Madics Systems Ltd] #34181 28 Apr 21 05:21 PM
Joined: Jun 2001
Posts: 11,645
J
Jack McGregor Online Content
Member
Online Content
Member
J
Joined: Jun 2001
Posts: 11,645
No unreported attention so far, but I'll take this as an indication of interest and will put it on the to-do list to see whether it's practical to turn it into something usable outside of the lab.

Re: Getting random SIGHUPs [Re: Dominic - Madics Systems Ltd] #34182 28 Apr 21 06:04 PM
Joined: Nov 2006
Posts: 2,192
S
Stephen Funkhouser Online Content
Member
Online Content
Member
S
Joined: Nov 2006
Posts: 2,192
Thanks

Last edited by Stephen Funkhouser; 28 Apr 21 06:04 PM.

Stephen Funkhouser
Diversified Data Solutions

Moderated by  Jack McGregor, Ty Griffin 

Powered by UBB.threads™ PHP Forum Software 7.7.3