XRT HOWTO --- Restart HK at ISAS
K. Glotfelty, D. McKenzie, K. Reeves, M. Weber; 2007-Aug-16
If the XRT HouseKeeping (HK) appears to have stopped working, one
possibility is that the HK processes at ISAS have hung. This HOWTO
explains how to restart those processes.
NOTE (added 2008-11-10 by K. Reeves): This procedure only restarts
the HK is cron jobs are stuck at ISAS. The housekeeping software has
been reconfigured, so that this is much less likely. If the
housekeeping is not updating, it is probably due to the cron jobs at
SAO. Contact xrt_manager@head.cfa.harvard.edu if this appears to be
the case.
Overview
There are cronjobs at ISAS that run xskim and related
software routines in order to collect the HK information from the
telemetry pipeline. These cronjobs run about once per hour. There
is another cronjob at SAO that pulls that info to the SAO XRT HK
website, and that cronjob also runs about once per hour. Therefore,
in order to fix the hung HK jobs, you only have to kill the processes.
The cronjobs will automatically restart everything within an hour.
It may take up to about 2 hours for the new HK data to appear on
the SAO webpage. However, it is also possible that the new xskim
and mergeband processes will start immediately, and you do not
want to kill the new ones.
Procedure
- Log on to xrtco workstation. Check the directory
/soda/solarb/xrt/status/yyyy/mm/dd.
- If there is recently created
data in this directory, the cron jobs at ISAS are working fine, and
the problem is at SAO. DO NOT follow the steps below - contact
xrt_manager@head.cfa.harvard.edu.
- If there is no recently created data, the cron jobs at ISAS have
stopped; continue with the steps below.
- Log into ISAS workstation "norway". You will have to log in to the gateway machines (ssh1, ssh2 or ssh3) as user sbusxrt first.
- Look for the mergebands and xskim processes,
and note their process ID numbers:
- The mergebands processs look like:
[sbusxrt@pg2~]$ ps auwx | grep mergebands
sbusxrt 22521 0.0 0.0 2068 956 ? S 00:18 0:00 sh
/home/sbusxrt/install/mergebands.sh 2007/08/09
- The xskim processes look like:
[sbusxrt@pg2~]$ ps auwx | grep xskim
sbusxrt 22734 0.0 0.0 3148 1608 ? S 00:26 0:08
/home/sbusxrt/install/xskim -f 08/09/2007 00:00:00 -t +23:59:59
-l -a sval -b x -v 0 -C mworking
- Kill the xskim job. (Using the ID number from this
example...):
- [sbusxrt@pg2~]$ kill -9 22734
- Wait until after any mergeband jobs are done.
- Often the new xskim and mergebandjobs will
begin right away. This is okay. But if it happens, you
should verify that the new xskim and mergebands
processes have new ID numbers.
(Use ps auwx and grep to check ID numbers.)
- Verify that the file ~/mworking is removed.
- You should NOT remove the mworking file until the
old merge jobs are finished. In fact, it should
remove itself automatically when the merge scripts
are done. This file tells the cronjob not to start another
job if it's working on a current one (the cron runs every
hour, and sometimes xskim takes longer than that to
run).
- Send email (To: xrt_manager@head.cfa.harvard.edu; Cc:
xrt_co@solar.isas.jaxa.jp) announcing the time that
you killed the HK processes, so that others do not also try to
fix the problem. In the email, also remind people that it may
take up to about 2 hours before the new HK info is transmitted
to the SAO HK website.