From: jparker@hpbs3645.boi.hp.com (Jack Parker) Newsgroups: comp.databases.informix Subject: DBAmon - Engine monitor Date: 1 Apr 1994 15:03:43 -0500 Folks, Some time ago I wrote a monitor to track engine performance and tell the DBA of problems. Mike Alder has since made several enhancements to it. This is an update of the one I posted previously (probably about a year ago). I asked for and received permission to make this public as payment for all the help I've received from the net. Please note it makes extensive use of awk which may not work the same way on other platforms. Also note that if it blows up your data - that's your problem not mine. No warranties expressed or implied. As always - any comments or suggested improvements would be welcome. Synopsis: DBAmon - Database engine monitor Rating: B+ Author: JParker/Mike Alder Tested under: HP - Online 5.0 Pro: has saved our bacon a number of times. Con: is a resource hog, it dumps a memory image (later cleaned up) which can be rather large. DBAmon reads a file (online.engines) to determine what engines are running on a platform and then performs a tbstat against each engine and performs a number of checks to ensure: a) that the engine is running b) that the IStar deamon is running (configurable) c) checks installation permissions (uses ckfls) d) check performance against user-defined thresholds e) check logfiles for errors f) check chunk status != PD or MD g) check log files backup (only applicable when performed to disk) h) check for resource hogs or DOA users i) is a tbtape -c running (configurable) j) check for free system memory k) check for disabled page cleaners l) check for free dbspace m) check disk fragmentation n) check for logfile errors o) check Online reserved pages Results are mailed to the $DBA. --- cut here -------------------------------------------------------------- # This is a shell archive. Remove anything before this line, # then unpack it by saving it in a file and typing "sh file". # # Wrapped by /Informix on Fri Mar 4 18:05:39 1994 # # This archive contains: # DBAmon online.engines ckfls # LANG=""; export LANG PATH=/bin:/usr/bin:$PATH; export PATH echo x - DBAmon cat >DBAmon <<'@EOF' #!/bin/ksh ############################################################################ # # Shell to monitor DB Engine performance. # # Jack Parker, HP Boise Surface Mount Center 12/92 # # Appended,Modified and Maint - Contact Mike Alder Ext. 5245 # # Maintained some more JParker 2/26/94 # (learned a bit since '92) - shortened somewhat, but didn't go # overboard. # ############################################################################ ENGINE_FILE="/users/informix/online.engines" # see enclosed sample online.engines. PAGE_SIZE=4098 # this can be found using monitor->Virtual mem # lower right hand corner OUT_FILE=/tmp/DBAmon.tmp # where to write core image of engine # (can be big) DBA=address@location # who to tell of problems alert_cmd=mailx # how to tell them. ############################################################################ SYSTEM=`uname -n` # Monitor thresholds - tell DBA after this level is passed. # note these are PER DAY figures. ############################################################################ set_var() { o_buff_th=50 # Over buffer o_user_th=10 # Over users o_lock_th=10 # Over locks o_tbls_th=10 # Over Tables (too many requests for tblspace) o_buf_wt_th=20 # buffer waits o_lck_wt_th=10 # lock waits o_doa_lck_th=10 # Deadlocks o_dltout_th=10 # Deadlock timeouts o_lchwait_th=20 # latch waits o_ckp_wt_th=20 # checkpoint waits # note these are NOT per day figures. ########################################################################### rcache_th=95 # read cache % wcache_th=82 # write cache % pcnt_th=20 # disk full at 20 percent free lgf_th=66 # log files are full - 66 % b_m_th=80 # buffers are too small b_l_th=70 # buffers are too big lg_full_th=75 # Logical log device is full mfr_th=10000 # how much memory should be free in "kbytes" lst_arch_th=1 # how many days ago should archive have been? } ############################################################################ # tbtape -c running? # doesn't always give right numbers # but only prints when there is a problem # we don't use tbtape -c, so little opporunity # to test. chk_tbtape() { no_tps=`grep -v "^#" $ENGINE_FILE |cut -d":" -f 4|grep -c y` norun=`ps -e | grep -c "tbtape -c"` if [ -z "$norun" ] then norun=0 fi if [ -z "$no_tps" ] then no_tps=0 fi let no_tps="$no_tps * 2" if [ $no_tps -ne $norun ] then echo "\nOnly $norun of the $no_tps continuous backups are running!" fi } ############################################################################### # get list of engines from system file # format is: (user maintained) #Engine_Name:config_file:server_no:tbtape-c?:arch_level:grps get_engines() { engines=`grep -v "^#" $ENGINE_FILE |cut -d":" -f 1` } ############################################################################### # Check to see if engine is on line. # how many days has engine been up? # reset thresholds for that number # watch out for bogus values... set_thrs() { up=`tbstat -o /dev/null | grep -i "on-line" | awk '{print $8}'` if [ -z "$up" ] then echo " $engine is OFF line - no further stats will be taken" RETURN_FLAG=1 return fi dys_up=`echo $up | grep -c ":"` if [ -z "$dys_up" ] then echo variable dys_up was not set properly return fi if [ $dys_up -ne 0 ] then up=1 fi if [ $up = "--" ] then up=1 fi let buff_th=o_buff_th*$up # Over buffer let user_th=o_user_th*$up # Over users let lock_th=o_lock_th*$up # Over locks let tbls_th=o_tbls_th*$up # Over Tables (too many rqst for tblspc) let buf_wt_th=o_buf_wt_th*$up # buffer waits let lck_wt_th=o_lck_wt_th*$up # lock waits let doa_lck_th=o_doa_lck_th*$up # Deadlocks let dltout_th=o_dltout_th*$up # Deadlock timeouts let lchwait_th=o_lchwait_th*$up # latch waits let ckp_wt_th=o_ckp_wt_th*$up # checkpoint waits } ############################################################################### # check permissions on distribution files - only run on Sundays ############################################################################### chk_perm() { # Verify correct installation permissions on all files at low peak ck_day=`date +%a` if [ -z "$ck_day" ] then echo "ck_day is not set " return fi if [ $ck_day = "Sun" ] then /users/informix/bin/ckfls onlinefiles >> /tmp/DBA.warning /users/informix/bin/ckfls starfiles >> /tmp/DBA.warning /users/informix/bin/ckfls sqlfiles >> /tmp/DBA.warning fi } ############################################################################### # check that Page Cleaners aren't disabled chk_pge_clean() { tbstat -F | sed 1,7d | sed '/^$/d' | grep -v "Chunk" | awk '{print $4}' | while read n do if [ $n = E ] then echo "Warning Page Cleaners have been Disabled \ check the tbstat -F command" fi done } ############################################################################### # check system memory chk_sys_mem() { mfree=`vmstat | awk '{if (NR==3){print $5}}'` let mfree=$mfree*$PAGE_SIZE/1024 if [ $mfree -lt $mfr_th ] then echo Only $mfree"k" Memory Left!! fi mfree="" } ############################################################################### # Disk usage (alert when dbspace full) chk_dbsp_fr() { tbstat -d $OUT_FILE | awk ' BEGIN { p_flg = 0 } /Chunks/ {p_flg = 4} /active,/ {p_flg = 0} { if (p_flg > 1) p_flg-- if (p_flg == 1 && $7 != "MO-" ) { p_free = $6/$5*100 if (p_free < pcnt_th) printf("Warning: Chunk %d of disk %s is %d percent free\n", $2, $8, p_free); } }' pcnt_th=$pcnt_th p_free="" } ############################################################################### # Performance (wcache, rcache) # over max users, locks, tblspaces, buffers chk_perf() { tbstat -p $OUT_FILE | awk ' { if (NR==6) { if ($4 < rchc_th) printf("Warning: Read cache hits are at %s\n", $4); if ($8 < wchc_th) printf("Warning: Write cache hits are at %s\n", $8); } if (NR==12) { if ($1 > tbls_th) printf("Warning: Over Table Spaces %s\n", $1); if ($2 > lock_th) printf("Warning: Over Locks %s\n", $2); if ($3 > user_th) printf("Warning: Over Users %s\n", $3); if ($4 > buff_th) printf("Warning: Over Buffers %s\n", $4); } if (NR==15) { if ($1 > buf_wt_th ) printf("Warning: Buffer Waits - %s\n", $1); if ($2 > lck_wt_th ) printf("Warning: Lock Waits - %s\n", $2); if ($4 > doa_lck_th) printf("Warning: Deadlocks Waits - %s\n", $4); if ($5 > dltout_th ) printf("Warning: Deadlock Timeouts - %s\n", $5); if ($6 > lchwait_th) printf("Warning: Latch Waits - %s\n", $6); if ($7 > ckp_wt_th ) printf("Warning: Checkpoint Waits - %s\n", $7); } } ' wchc_th=$wcache_th rchc_th=$rcache_th \ buff_th=$buff_th user_th=$user_th lock_th=$lock_th tbls_th=$tbls_th \ buf_wt_th=$buf_wt_th lck_wt_th=$lck_wt_th doa_lck_th=$doa_lck_th \ dltout_th=$dltout_th lchwait_th=$lchwait_th ckp_wt_th=$ckp_wt_th tbls_th="" lock_th="" buff_th="" buf_wt_th="" lck_wt_th="" doa_wt_th="" dltout_th="" lchwait_th="" ckp_wt_th="" } ############################################################################### # Logical log files (alert when 2/3 are full) # logical logs buffer size, alert when wrong size # physical logs buffer size, "" # Don't know many rows there will be so we can't use # NR (in the awk), but we DO know how soon after # each header this stuff will START - hence the # strange flg_stuff. chk_logfls() { tbstat -l $OUT_FILE | awk ' BEGIN { l_flg = 0 pb_flg=0 lb_flg=0 lgs=0 lgf=0 } /Physical/ {pb_flg = 4} /Logical/ { lb_flg = 4 } /address/ { l_flg = 3 } { if (l_flg > 1) l_flg-- if (l_flg == 1) { lgs=lgs+$6 lgf=lgf+$7 } if (pb_flg > 1) pb_flg-- if (pb_flg == 1) { pcnt_fl=$6/$3 pcnt_tim=pcnt_fl*100 if (pcnt_tim > b_m_th) printf ("Warning: Physical Log Buffers too small -- %s utilized\n",pcnt_fl*100); if (pcnt_tim < b_l_th) printf ("Warning: Physical Log Buffer too large -- %s utilized\n",pcnt_fl*100); pb_flg=0 } if (lb_flg > 1) lb_flg-- if (lb_flg == 1) { lcnt_fl=$8/$3 lcnt_tim=lcnt_fl*100 if (lcnt_tim > b_m_th) printf ("Warning: Logical Log Buffer too small -- %s utilized\n",lcnt_tim); if (lcnt_tim < b_l_th) printf ("Warning: Logical Log Buffer too large -- %s utilized\n",lcnt_tim); lb_flg=0 } } END { if ((lgf/lgs*100) > pcnt_th) printf("Warning: Log files are %s percent full\n", lgf/lgs*100); }' pcnt_th=$lgf_th b_m_th=$b_m_th b_l_th=$b_l_th } ############################################################################### chk_err() #check to see if errors in the logfile ############################################################################### { #This command tails the online.log file and sees if there is an error. #or checks for latest config changes MSG_FILE=`tbstat -c | grep MSGPATH | awk '{print $2}'` tail -150 $MSG_FILE | grep -i -e error -e param -e abort -e warning } ############################################################################### chk_star() #check to see if star is running ############################################################################### { #Initialize my variables starname="" found_tmp="" found="" # Verify there should be a star running star_run=`grep $engine $ENGINE_FILE | cut -d":" -f9` if [ $star_run = "y" ] then #get the star name from the engine file starname=`grep $engine $ENGINE_FILE | cut -d":" -f11` if [ -z "$starname" ] then echo "star search did not recieve a variable from $ENGINE_FILE" return fi found_tmp=`ps -ef | grep $starname` if [ -z "$found_tmp" ] then echo "Warning star $found $starname is not currently running!" fi fi } ################################################################################ chk_res() # check online reserved pages ################################################################################ { tbcheck -cr | grep -i -e warning -e error } ############################################################################### chk_chnk_st() # check chunk status must not be "PD" or "MD" ############################################################################### { #CHECK to see if Primary Chunk is DOWN tbstat -d | grep PD | awk '{print "CAUTION " $8 " CHUNK IS DOWN!!"}' #CHECK to see if Mirror Chunk is DOWN tbstat -d | awk '{print $7}' | grep MD | awk '{print "CAUTION " $8 " MIRROR CHUNK IS DOWN!!"}' } ############################################################################### chk_lg_bckp() # check logfile backup (disk only) ############################################################################### { LTAPE=`grep LTAPEDEV $INFORMIXDIR/etc/$TBCONFIG | awk '{print $2}'` if [ -f $LTAPE ] then LTSZ=`grep LTAPESIZE $INFORMIXDIR/etc/$TBCONFIG | awk '{print $2}'` LTF=`ls -s $LTAPE| awk '{print $1}'` let LT_F=$LTF*512 let thrsh=$lg_full_th*$LTSZ*10000 if [ $LT_F -gt $thrsh ] then echo "Logfiles are currently $LT_F full!" fi else if [ $LTAPE != "/dev/null" ] then echo "Cannot stat logfile $LTAPE!" fi fi } ############################################################################### # check date of last archive chk_lst_arc() { last_arch=`tbcheck -pr | grep "Real Time Archive" | awk '{print substr($5,4,2)}'` if [ -z "$last_arch" ] then echo " ARCHIVE last_arch variable in chk_lst_arc has not been set." echo " No ARCHIVE status will be done." return fi let yest="`date +%d` - $lst_arch_th" if [ $yest -gt 0 ] # don't worry at month end, date arithmetc not worth it. then if [ $yest -ne $last_arch ] then echo "Last archive was the "$last_arch"th" fi fi last_arch="" } ############################################################################### chk_rsrc() { # Resource hogs or DOA # Note: this is a moment in time, could be coincidence # that someone has a resource just now - not necessarily # a hog # prints buffers which are waited upon # $9 = user holding, $10 = user waiting tbstat -b $OUT_FILE | awk ' BEGIN { p_flg = 0 } /address/ {p_flg = 2} /modified/ {p_flg = 0} { if (p_flg == 1) { print $0 } if (p_flg > 1) { p_flg-- } }' | sort -b -k8 > /tmp/onl.buffers # prints locks which are waited upon # $2 = user waiting, $3 = user holding tbstat -k $OUT_FILE | awk ' BEGIN { p_flg = 0 } /address/ {p_flg = 2} /active/ {p_flg = 0} { if (p_flg == 1 && $2 != 0) { print $0 } if (p_flg > 1) { p_flg-- } } ' | sort -b -k2 > /tmp/onl.locks # prints latches which are waited upon # $5 = holder of latch tbstat -s $OUT_FILE | awk ' BEGIN { p_flg = 0 } /name/ {p_flg = 2} { if (p_flg == 1 && $0 != " " && NF > 0) { print $0 } if (p_flg > 1) { p_flg-- } }' | sort -b -k4 > /tmp/onl.latches # files exist, must have something if [ -s /tmp/onl.buffers -o -s /tmp/onl.locks -o -s /tmp/onl.latches ] then # who is the culprit? # $1 = user address tbstat -u $OUT_FILE | awk ' BEGIN { p_flg = 0 t_flg = 0 } /address/ {p_flg = 2} /active/ {p_flg = 0 t_flg = 1} { if (p_flg == 1 && t_flg == 0) { print $0 } if (p_flg > 1) { p_flg-- } } ' | sort > /tmp/onl.usrs echo "User is holding resources!" echo "Buffers:" join -o 1.4 1.5 1.3 1.1 1.2 1.8 1.9 1.10 2.5 2.6 2.7 2.8 2.10 -1 1 -2 9 \ /tmp/onl.usrs /tmp/onl.buffers echo "Latches:" join -o 1.4 1.5 1.3 1.1 1.2 1.8 1.9 1.10 2.1 2.2 -1 1 -2 5 \ /tmp/onl.usrs /tmp/onl.latches echo "Locks:" join -o 1.4 1.5 1.3 1.1 1.2 1.8 1.9 1.10 2.1 2.2 2.5 -1 1 -2 3 \ /tmp/onl.usrs /tmp/onl.locks fi rm -f /tmp/onl.* } ############################################################################### # Main ############################################################################### #because tbstat -o dumps so much data make sure that it is set before running. if [ -z "$OUT_FILE" ] then echo "Variable OUT_FILE has not been set." echo "Please define variable and restart" echo "Variable OUT_File not set" | $alert_cmd $DBA exit fi rm -f /tmp/DBA.warning # shoot old message file get_engines # Get list of engines #check installation permissions chk_perm # for each engine for engine in $engines do RETURN_FLAG=0 TBCONFIG=`grep $engine $ENGINE_FILE | cut -d":" -f 2` export TBCONFIG # save shared mem segment so all tests run against # same instant echo "\nEngine: " $engine tbstat -o $OUT_FILE set_var if [ -z "$o_buff_th" ] then continue fi set_thrs # set thresholds to reflect # days up # verify that engine is running(on-line) if [ $RETURN_FLAG = "1" ] then continue fi chk_perf # check performance chk_logfls # check logfiles chk_chnk_st # check chunk status must not be "PD" or "MD" chk_lg_bckp # check logfiles backup (only applicable # when performed to disk) chk_rsrc # check resource hogs or DOA # can really chew up resources, might not be # worth it. chk_tbtape # is tbtape -c running? chk_sys_mem # check system memory chk_pge_clean # Check disabeled page cleaners chk_dbsp_fr # check free dbspace chk_lst_arc # last archive not whenever tbcheck -ce | grep -e WARNING -e Error # disk fragmentation chk_err # check the logfile for errors chk_star # see if star is running chk_res # check online reserved pages done >> /tmp/DBA.warning if [ -f /tmp/DBA.last_warn ] then # status change since last warning? a=`wc /tmp/DBA.warning | awk '{print $1}'` b=`wc /tmp/DBA.last_warn | awk '{print $1}'` if [ $a != $b ] then $alert_cmd $DBA < /tmp/DBA.warning fi else $alert_cmd -s "DBA Monitor" $DBA < /tmp/DBA.warning fi cp /tmp/DBA.warning /tmp/DBA.last_warn if [ -r "$OUT_FILE" ] then rm $OUT_FILE fi @EOF chmod 755 DBAmon echo x - online.engines cat >online.engines <<'@EOF' #DBSERVERNAME:config_file:server_no:tbtape-c?:arch_level:grps:sqlexec:profile:sqlexecd?:no_of_tapes:starname:starlog Format is: Servername (DBSERVERNAME from $TBCONFIG) $TBCONFIG (without the $INFORMIXDIR/etc please) Server Number (SERVERNUM from $TBCONFIG) tbtape -c? (y|n) should a tbtape -c be running for this engine? arch_level 7 digit number, 1 digit/weekday indicating archive level to perform on that day grps an abbreviation of the group names which have access to this engine (eg: 'ab' for groups ab_user, ab_dev, abmfg...) useful for scripts which set $DBPATH or $TBCONFIG see set_eng. sqlexec (sqlturbo|sqlexec) profile Don't know, probably the name of a profile to use for this engine. sqlexecd (y|n) should an IStar deamon be running for this engine? no_of_tapes 7 digit number again, how many tapes does this level of archive take? starname the name of the star server (from sqlhosts and /etc/services) starlog the name of the IStar logfile eg: server1:tbconfig.srvr1:1:y:0122122:ab:sqlexec::sqlturbo:3211211:star27:star27.log @EOF chmod 644 online.engines echo x - ckfls cat >ckfls <<'@EOF' #!/bin/ksh ############################################################################## # Check the informix file distribution list against actual files to verify # that all have the correct ownerships and permissions. # Syntax is: ckfls $filename # where $filename is the filelist to check (e.g. inlinefiles, netfiles, # sqlfiles) # Jack Parker 4/93 # ############################################################################## ############################################################################## # modify the -rwxr-x--- string into a 4 digit permission figure. ############################################################################## ck_pm() { act_pm=`echo $af | awk '{ b1=substr($1,1,1) b2=substr($1,2,1) b3=substr($1,3,1) b4=substr($1,4,1) b5=substr($1,5,1) b6=substr($1,6,1) b7=substr($1,7,1) b8=substr($1,8,1) b9=substr($1,9,1) b0=substr($1,10,1) op=0 gp=0 xp=0 sp=0 if (b2=="r") {op+=4} if (b3=="w") {op+=2} if (b4=="x") {op+=1} if (b4=="s") {sp+=4 op+=1} if (b4=="S") {sp+=4 op+=1} if (b5=="r") {gp+=4} if (b6=="w") {gp+=2} if (b7=="x") {gp+=1} if (b7=="s") {sp+=2 gp+=1} if (b7=="S") {sp+=2 gp+=1} if (b8=="r") {xp+=4} if (b9=="w") {xp+=2} if (b0=="x") {xp+=1} if (b0=="t") {sp+=1 xp+=1} if (b0=="T") {sp+=1 xp+=1} printf("%1d%1d%1d%1d\n", sp, op, gp, xp) }'` } ############################################################################## # main loop ############################################################################## # lose the comments grep -v "^#" /usr/informix/etc/$1 | while read line do # don't process null lines if [ -n "$line" ] then # get the file name fnm=`echo $line | cut -d" " -f1` # we still will get a line called 'PRODUCT' lose it if [ $fnm != "PRODUCT" ] then # get owners, group, and permission number own=`echo $line | cut -d" " -f2` grp=`echo $line | cut -d" " -f3` mod=`echo $line | cut -d" " -f4` # ll the file af=`ll -d $INFORMIXDIR/$fnm` # develop a 4 digit permission to compare ck_pm # get actual owner and group act_own=`echo $af | cut -d" " -f3` act_grp=`echo $af | cut -d" " -f4` # are they the same? if [ $act_own != $own -o $act_grp != $grp -o $act_pm -ne $mod ] then echo "Warning Check Permissions on the following files" echo "------------------------------------------------" echo "Mismatch - Actual :\n$af\nCorrect :\n$INFORMIXDIR/$line\n" fi fi fi done @EOF chmod 755 ckfls exit 0 --- and here -------------------------------------------------------------- _____________________________________________________________________________ Jack Parker | Hewlett Packard, BSMC Boise, Idaho, USA| "Someday you'll go far, jparker@hpbs3645.boi.hp.com | if you catch the right train." (208) 396-5388 (W) (208) 384-1623 (H) | _____________________________________________________________________________ Any opinions expressed herein are my own and not those of my employers. _____________________________________________________________________________