Previous Thread
Next Thread
Print Thread
ASHLOG: Odd JCB message #33644 11 Nov 20 06:51 PM
Joined: Sep 2002
Posts: 5,450
F
Frank Online Content OP
Member
OP Online Content
Member
F
Joined: Sep 2002
Posts: 5,450
Hey Cap -

How are supplies holding out in the bunker? I think it's safe to venture out... but i would stay in after dark... wink

Was doing some triaging and noticed these odd messages in the ashlog. (display courtesy of our watchdog program "gstat"). Not sure what to make of them or if a reset is in order or just chalk it up to the sign of the times...

TIA

Attached Files Capture.PNG
Re: ASHLOG: Odd JCB message [Re: Frank] #33645 11 Nov 20 07:23 PM
Joined: Jun 2001
Posts: 11,645
J
Jack McGregor Online Content
Member
Online Content
Member
J
Joined: Jun 2001
Posts: 11,645
TP: check
Wine: getting down to the swill, except for the Champagne waiting for the appropriate moment.
Whiskey: long gone
But it's quiet in the neighborhood. I even got out for a hike last night.

As for the odd messages - fortunately they're less odd than they look. Normally "JCB Verify Errors" would be serious, but only when the process number differs between the "expected" and the "was", in which case it indicates that two independent processes are somehow trying to share the same job. In this case, the only difference between the expected value and the actual value is the ? in the first character, which is the indicator that the job has received a SIGHUP. The signal handler plugs the ? in as a marker, indicating the the tty part of the identifier is no longer valid, but the pid may still be running in background (until such point that it stops for input or gets another error). In other words, what you're seeing is the result of unexpected disconnects followed by some program operation that triggers a job table read or update. I think this particular message has been removed in later versions (to avoid raising the concern you're raising now).

The only real concerns that might be worth additional investigation are:

1) Why are there so many disconnects? Users don't know how to disconnect properly? Network problems? (Note that you don't need the message above to look for disconnects - instead just look for "SIGHUP", assuming you have TRACE=SIGHUP set, which I always recommend.)

2) Are the programs responding in a sensible way to the disconnect? (To answer this question, look at the raw log for the SIGHUP messages, then search forward from there matching the pid to see what else the jobs are doing before they eventually disconnect.)

Re: ASHLOG: Odd JCB message [Re: Frank] #33646 11 Nov 20 07:57 PM
Joined: Sep 2002
Posts: 5,450
F
Frank Online Content OP
Member
OP Online Content
Member
F
Joined: Sep 2002
Posts: 5,450
Thanks for the reply.

I am glad you have your supplies list in the appropriate order wink Indeed it has been warmer than usual here and there are some late autumn trees making it very nice to get out as well.

Interesting analysis.. it does seem like some sort of catastrophic disconnect occurred on this server around 10-11am today. Nobody reported it directly so i am asking around. Yes we do trace for SIGHUPs. below is a sampling of what i see... And yes, we attempt to do housekeeping if the program detects a error >=250 it tries to log it and sign out the user.

Attached Files Capture.PNG
Re: ASHLOG: Odd JCB message [Re: Frank] #33647 11 Nov 20 08:06 PM
Joined: Jun 2001
Posts: 11,645
J
Jack McGregor Online Content
Member
Online Content
Member
J
Joined: Jun 2001
Posts: 11,645
The timestamps for your excepts of SIGHUPs above ranges over 3 hours, so while it could certainly be network flakiness, I'm not sure what the evidence is for a catastrophic disconnect. But it does sound like you've got a good handle on it. And BTW, that's a very nice looking log-file display utility!

Re: ASHLOG: Odd JCB message [Re: Frank] #33648 11 Nov 20 08:51 PM
Joined: Sep 2002
Posts: 5,450
F
Frank Online Content OP
Member
OP Online Content
Member
F
Joined: Sep 2002
Posts: 5,450
Thanks Cap! This was sort of my clean out the garage project... got to the point it was really hard to manage some of our larger cloud servers. This allows me to manage ashell, linux and aslog info in a way i can sort and filter results in meaningful ways. I can filter and/or sort on most columns. It also helps to find runaway jobs or processes, zombies, etc. I have tied the jobtbl data back to the database so support can see who is doing what and were, as well as what locks or processes are open. I can then take the PID and trace it back to the server and ashlog events.

Above I sorted the comments area so you can see (by the ashlog row number) these were all coming in at different times i guess over this morning. I used to have a filter on this column but it took too long to process (max of 10K display rows). What i still need to do is add is the ability to go back to the previous 10K rows, and even earlier instances of ashlog... but for now its the most current 10k rows. (which believe it or not can go in the blink of an eye!) I can filter on PID and/or program however which allows me to track a single users events.

If i find a trigger for the above messages i will let you know... now seems you might need a trip back to the liquor store... wink






Last edited by Frank; 11 Nov 20 09:04 PM.

Moderated by  Jack McGregor, Ty Griffin 

Powered by UBB.threads™ PHP Forum Software 7.7.3