Work Header

[SKB-1539] Shield oscillation may drop below specified minimum frequency after several months of continuous operation

Work Text:

User: Capt. Eric James (@EKJ2068)
Date: 0050-08-32 10:48:32
New issue: [SKB-1539] Shield oscillation may drop below specified minimum frequency after several months of continuous operation

I noticed something today in the shield oscillation control code that concerns me. The short version, for people who will recognize the type of problem as soon as I point it out and/or who are too busy to read this whole block of text, is that the oscillator controller uses the built-in fractional number type, which is sometimes inexact, in a way where the safety shutdown could be incorrectly activated because of small discrepancies between the intended value and the closest value that the format can actually represent.

Here's the long version:

Obviously I haven't tested this, since the shield itself isn't online yet (although I suppose we could always simulate the offsets that would be introduced by natural fluctuations in the flow rate of the generator), but I did some quick calculations and I'm concerned that if we leave the system online for more than 120 days or so, we may run into a problem where the compounding errors from small inaccuracies in the way the ratios are represented lead to a cascade that causes the computed shield frequency to drop below the recommended safe minimum of 56.6 DecF. We should be fine if it drops to 56, as long as it doesn't happen regularly/predictably (although, please see the note I've included at the end of this message for an explanation of why I don't think we can completely dismiss this as a concern), but if it drops to 55 or below, I don't even want to speculate about what kinds of problems we might run into.

Take a look at shield/oscillator_control.fp, especially line 242. The formula for adjusting the shield refresh frequency so it can sync back up with the power coming in from the generator on the next microcycle uses the reciprocal of one of the inputs from the generator -- basically, we're dividing by a number that might be really, really small, so if that number is even slightly wrong, we're going to be looking at potentially a very big error in the result. And the problem is that that input comes from a part of the generator where it's computed by adding up a bunch of really small offsets, some of which are negative, and because the arithmetic is inexact, there might be a situation where, for example, it adds 0.001 to -0.001 and comes up with a 0.000000002 or something instead of an actual 0 result.

Ordinarily, I'd solve a problem like this by updating the code so that the short-circuit condition checks that the input is within some small neighborhood of 0 instead of doing an exact equality check, but I'm pretty sure that won't work here... there are situations where that number might be *really* small but it might still be a valid input. Every time the generator sync function sends us a value, it then assumes we've updated according to that value when it sends the next one, so if it sends a very large value immediately after a very small one, there might be a case where the difference between those two values is within our tolerances *but* the difference between *zero* and the second value is too big a jump for the hardware to handle, which could trigger the safety latch and knock the shields completely offline for several minutes. And we don't have a way to feed the info about an ignored nonzero input back to the caller, at least not with the way the interface is currently designed... so they can't compensate for it on their end either.

So we need to be able to handle arbitrarily small nonzero inputs as if they were legitimate, because they very well might be; we can't just throw them out. But I'm very concerned about the potential repercussions here.


(Note: I *do* think there are reasons to be worried about predictability here, given that the behavior could potentially be triggered by certain combinations of generator flow rate and shield oscillation frequency. There's no way for an external attacker to manipulate the flow rate, but there's no way to *stop* them from *observing* it -- since it's from, you know, the star we're orbiting -- and if they think to measure it over time and also measure the fluctuations in the shield frequency, they might be able to reverse-engineer the relationship between them. A possible fix for this would be to introduce some random latency into the resync process, but that raises a different set of potential problems.)

User: Lt. Alae Nereida (@APN9971)
Date: 0050-08-32 15:12:19
Comment #1 on ^SKB-1539

+1. I was wondering about this myself, and just hadn't gotten around to opening an issue yet. I know there are performance issues with the IntegerRatio type when we're doing a lot of calculations at once, but I doubt the loss in efficiency would be worse than the potential ramifications if this is left completely unhandled. At the very least, I suggest that we try implementing it the safer way and then profile both versions of the code to see if the performance hit is enough to be concerned about.

Honestly, I don't understand why the library code for working with integer ratios isn't more heavily optimized. Given that we work with a lot of fractional values in mission-critical systems... I'll spare you this rant, because everyone on the team has heard it before, but you know how I feel about arithmetic being potentially imprecise by default. It's not like FleetPlus is a general-purpose programming language for casual users, so why are we prioritizing ease of use over correctness? (Rhetorical -- I've read their justification, I just don't agree with it.) (And if you don't already know how I feel about this, you can go watch the talk that I gave about layer types at the Imperial Conference on New Directions in Interface Programming last year. There's a link to the holorecording on my staff profile.)

User: Capt. Marie Xan (@MXX1232)
Date: 0050-08-34 09:11:52
Comment #2 on ^SKB-1539

Performance issues shouldn't be a concern at all, since the pace-keeping function only polls the generator every eighty milliseconds or so. The concern for me is the overhead that this change would introduce, in terms of updating the rest of the codebase. Any change we make in the interface to the shield library is going to need to be propagated out into *every single place* that those functions are called, *anywhere*, *ever*, and at this point in the project that's probably already upwards of ten thousand lines of code.

User: Capt. Eric James (@EKJ2068)
Date: 0050-09-02 12:31:46
Comment #3 on ^SKB-1539

Oh, you're right about performance -- I didn't realize we had such a big drift allowance. Nice to know!

The interface wouldn't be a problem if we had used behavioral types, but I guess that shuttle has flown.

User: Lt. Alae Nereida (@APN9971)
Date: 0050-09-03 07:22:01
Comment #4 on ^SKB-1539

If it's just a question of changing the syntax at the callsites, I can automate that transformation.

User: Capt. Eric James (@EKJ2068)
Date: 0050-09-03 11:19:20
Comment #5 on ^SKB-1539

Please do not under any circumstances try to automate that. I haven't been able to get SabreAurek server #286 to communicate reliably with the command post on the north shore since you ran that hand-written transpiler on its configuration files.

User: Lt. Alae Nereida (@APN9971)
Date: 0050-09-03 12:03:57
Comment #6 on ^SKB-1539

We could do it with Lithe.

User: Capt. Eric James (@EKJ2068)
Date: 0050-09-03 15:12:31
Comment #7 on ^SKB-1539

@APN9971 Do not do it with Lithe. Lithe is nowhere near ready for deployment on mission-critical hardware.

User: Lt. Alae Nereida (@APN9971)
Date: 0050-09-03 15:43:20
Comment #8 on ^SKB-1539

Understood, sir. I was (mostly) joking.

In all seriousness, though, I don't mind working over Festival of Stars if that would enable us to do the conversion by hand and still meet our end-of-year deadline. It's not like I have travel plans.

User: Lt. Alae Nereida (@APN9971)
Date: 0050-09-27 10:41:16
Comment #9 on ^SKB-1539

@EKJ2068 It's been a few weeks -- any update on what (if anything) you want me to do with this? Like I said, I'm happy to work over my vacation if that's what's called for, but if other teams are counting on this interface *not* changing while they're away, then obviously I won't go in and start fiddling with stuff. Would appreciate hearing back by the end of the week if possible.

User: Capt. Eric James (@EKJ2068)
Date: 0050-09-28 16:32:05
Comment #10 on ^SKB-1539

Sorry for keeping you waiting! So, I actually discussed this at length in the cafeteria a few days ago with @MXX1232 and a couple of the specialist officer cadets who we have on loan from Modeling & Verification. I wanted to ping you to come upstairs and join us, but your schedule said you were in the middle of your sidearm recertification exam. Anyway, we figured out that it's unlikely to pose a problem in practice, but by the time I got back to my desk I forgot to actually pull up the issue tracker and log it as closed -- I'll do that right now. No need to worry about fussing with the interface. Enjoy your vacation!

User: Capt. Eric James (@EKJ2068)
Date: 0050-09-28 16:55:47
Updated issue: ^SKB-1539 status changed from ACTIVE - UNDER DISCUSSION to CLOSED - WONTFIX

Preliminary mathematical modeling indicates that the bug should only happen if the energy generator and the shield array are both in continuous operation for upwards of three standard months. Given that the regular maintenance schedule involves a rolling shutdown of the generator every two months for hypermatter residue reclamation, we believe that the shield frequency should remain well within the limits of the specification in practice, even in cases where abnormally cold weather necessitates that the two pieces of equipment resync more often than usual. While there may be minor fluctuations in the shield frequency due to the technical details of how very small ratios are represented internally, it is highly unlikely to ever pose a security risk to Starkiller Base.