XRT HOWTO --- Restart HK at ISAS
K. Glotfelty, D. McKenzie, K. Reeves, M. Weber; 2007-Aug-16
If the XRT HouseKeeping (HK) appears to have stopped working, one
possibility is that the HK processes at ISAS have hung. This HOWTO
explains how to restart those processes.
Overview
There are cronjobs at ISAS that run xskim and related
software routines in order to collect the HK information from the
telemetry pipeline. These cronjobs run about once per hour. There
is another cronjob at SAO that pulls that info to the SAO XRT HK
website, and that cronjob also runs about once per hour. Therefore,
in order to fix the hung HK jobs, you only have to kill the processes.
The cronjobs will automatically restart everything within an hour.
It may take up to about 2 hours for the new HK data to appear on
the SAO webpage. However, it is also possible that the new xskim
and mergeband processes will start immediately, and you do not
want to kill the new ones.
Procedure
- Log into ISAS workstation "pg2".
- Look for the mergebands and xskim processes,
and note their process ID numbers:
- The mergebands processs look like:
[sbusxrt@pg2~]$ ps auwx | grep mergebands
sbusxrt 22521 0.0 0.0 2068 956 ? S 00:18 0:00 sh
/home/sbusxrt/install/mergebands.sh 2007/08/09
- The xskim processes look like:
[sbusxrt@pg2~]$ ps auwx | grep xskim
sbusxrt 22734 0.0 0.0 3148 1608 ? S 00:26 0:08
/home/sbusxrt/install/xskim -f 08/09/2007 00:00:00 -t +23:59:59
-l -a sval -b x -v 0 -C mworking
- Kill the xskim job. (Using the ID number from this
example...):
- [sbusxrt@pg2~]$ kill -9 22734
- Wait until after any mergeband jobs are done.
- Often the new xskim and mergebandjobs will
begin right away. This is okay. But if it happens, you
should verify that the new xskim and mergebands
processes have new ID numbers.
(Use ps auwx and grep to check ID numbers.)
- Verify that the file ~/data/status/mworking is removed.
- You should NOT remove the mworking file until the
old merge jobs are finished. In fact, it should
remove itself automatically when the merge scripts
are done. This file tells the cronjob not to start another
job if it's working on a current one (the cron runs every
hour, and sometimes xskim takes longer than that to
run).
- Send email (To: xrt_manager@head.cfa.harvard.edu; Cc:
xrt_co@solar.isas.jaxa.jp) announcing the time that
you killed the HK processes, so that others do not also try to
fix the problem. In the email, also remind people that it may
take up to about 2 hours before the new HK info is transmitted
to the SAO HK website.