[Date Prev][Date Next][Thread Prev][Thread Next] - [Date Index][Thread Index][Author Index]

Re: AO40 emergency software restart routine

"Vince Fiscus, KB7ADL" wrote:

> One would think that command-assist would also be a casualty of the
> lock up and that a locked up system would be unable or unaware that
> it should load and run the software. How do command-assist and other
> recovery routines actually work?

Most likely it is similar to a system we have implemented in an
ordinary (terrestrial) repeater that I help to maintain.  Here is
how our system works (FACTS), and my extrapolation to what I
believe is the answer to your question (SPECULATION):

A hardware timer chip is set to generate an "alarm" after a certain
amount of time passes.

When the computer is working properly, it repeatedly resets the time on
the timer chip to zero, at much shorter intervals than the duration of
the timer setting.  This means that the timer never generates an
"alarm" if the computer is working.

(In the case of the satellite, there may also be a timer that gets
set back to zero every time a command is received, so that even if the
computer is able to reset the timeout timer, an alarm is triggered if
there are no commands within a pre-set, long period of time.)

If the computer fails (or if the satellite "hears" no commands in a
long period of time), it stops resetting the timer chip.

When the timer successfully reaches its preset "alarm time", it
generates a master (hardware) reset of the computer.  The computer
boots up, does some self-testing, "figures out" that it is being
restarted from a timeout event, and performs remedial processing.

In the case of our terrestrial repeater, this involves simple things
like clearing status bits, checking to see if the repeater shack
security system is turned on but the door to the shack is open (i.e.,
time to trigger the burglar alarm and play a voice message over the
air that says "Intruder Alert!"), make sure that the phone patch is in
an "on-hook" state, select the proper courtesy beep tone, insure that
COR is off, etc., etc.

In the case of the satellite, the "remedial processing" is to begin
execution of the "command-assist" routine.  This would likely do
things like insure that the EDAC (error detection and correction)
is enabled, set the IF matrix to receive on pre-selected band(s)
and to transmit telemetry on other pre-selected band(s), activate
the omni antennas (or perhaps cycle back and forth between the omni
and high-gain antennas, in case the spacecraft is too far distant
or aimed in a direction that prevents it from hearing well with the
omni's), etc., etc.  It might even automatically re-boot the IHU-2
computer, and restart the "repeater" program that was playing the
telemetry over the VHF Middle Beacon.

In summary, HARDWARE is used to force a reset of the microprocessor
and support components, because this is NOT dependent on the processor
being operational.  The hardware reset (hopefully) gets the CPU going
again, and tells it to do whatever will maximize the chances of
regaining contact with the controllers on the Earth.

The "command assist" software is STARTED by the expiration of the
hardware timer.  Just as you imagined, we can't rely on a sick CPU
to diagnose itself as being sick!

Just like when your Windoze machine locks up, you don't expect it
to reboot itself automtically.  But if you push the reset switch, or
turn the power off and on again, you (usually) do expect that it will
come to life again.  Just think of "command assist" as being analagous
to automatically running scandisk when your PC was not shut down
properly, but instead underwent a hardware reset or a power off/on

> Also, what sort of commands was AO40 processing when the telemetry
> stopped?

I was curious about that myself.  So I went to the Amsat ftp telemetry
archive and grabbed the last telemetry block, unzipped it, and used
the playback function of P3T to take a look.  I didn't see anything
that enlightened me, but someone who has stared at telemetry longer
than I may be able to see something that answers the question.

Also, there's no assurance yet that it was a command that caused the
failure.  (I seem to recall that Oscar 13's IHU was only restarted a
couple of times in its many years of operation, and one of those was
an operator error [reversed data value and data location in a 'poke'
command], and one was an accidental reset command.)

Perhaps an exceptionally strong burst of radiation flipped memory
bits in both IHU-1 and IHU-2.  We know that IHU-2 has locked up several
times, shutting off the telemetry downlink, which is ONE of the current
symptoms.  The EDAC in IHU-1 can only correct single-bit errors in a
given memory word, and relies on the physical separation of the
microchip packages on the SIMMs (plus radiation shielding, of course)
to make it unlikely that two bits in the same word get hit by radiation
and altered at the same time.  But it could conceivably happen.

And so could a thousand other things.

It's a fascinating mystery to be solved.  Too bad that the stakes are
so high.

Via the amsat-bb mailing list at AMSAT.ORG courtesy of AMSAT-NA.
To unsubscribe, send "unsubscribe amsat-bb" to Majordomo@amsat.org