Configuring Watchdog Functionality
The Watchdog is a script (uwsift/util/watchdog.py
) running separately from
SIFT and has the responsibility to assess, whether SIFT running as
monitoring tool (with auto_update.active: True
) is working correctly and to
“bark” otherwise by calling an adaptor script raiseEvent.sh to notify
GEMS. The location of this script has to be configured as notification_cmd
.
The Watchdog does not directly interact with the running SIFT instance but
monitors a file to be configured as heartbeat_file
, which SIFT updates
with the data timestamp (i.e. the start_time
is written into the file) every
time it loads new data. From this information and the filesystem change time of
the heartbeat file the Watchdog can determine, when the monitoring system is not
alive anymore and/or it does not succeed to ingest up to date satellite
data. With the frequency configured by heartbeat_check_interval
the Watchdog
reads the file and compares the time information against the current time and
gives alarm, when the data timestamp stored is older than
max_tolerable_dataset_age
and/or the last time the heartbeat file was
updated is longer ago than the max_tolerable_idle_time
. These three time
span related configurations are in seconds.
To work around the memory leak in SIFT, the watchdog is able to issue a
restart request once the auto_restart_interval
is over. If the user denies
this request by cklicking on the cancel button in the popup window, the
watchdog will send another request every auto_restart_ask_again_interval
seconds. Both configuration options are in seconds and can be disabled with
the value 0.
Furthermore the watchdog is capable of monitoring the memory consumption of the
SIFT application. If the application exceeds the amount specified by
max_memory_consumption
, a restart request is issued. The units M
(Mebibytes) and G
(Gibiabytes) can be used. If this setting is not
given, the watchdog won’t trigger a restart based on excessive memory
consumption.
Note: The units are interpreted with base 1024 to be compatible with analogous configuration options of systemd (see systemd.resource-control: MemoryHigh)
A complete watchdog configuration looks as follows:
watchdog:
heartbeat_file: "$$CACHE_DIR$$/heartbeat.txt"
notification_cmd: /path/to/raiseEvent.sh
heartbeat_check_interval: 30
max_tolerable_dataset_age: 120
max_tolerable_idle_time: 60
auto_restart_interval: 86400
auto_restart_ask_again_interval: 60
max_memory_consumption: 5G
Note the part $$CACHE_DIR$$
of the path for the heartbeat file. When used,
this part is expanded to the default cache directory for the application
according to the XDG standard ($$CACHE_DIR$$
expands to ~/.cache/SIFT
on Linux systems). A normal absolute file path works too.