Application-level error blog - Page 6

MYSTERY #1

Will regulators be absolute in their focus on maximum divergence,
or will they allow exceptions?

RTS 25 specifies a “Maximum divergence from UTC” for different situations, and ESMA has gone to some length to clarify that these are hard limits. Not only did they remove a 99th percentile tolerance from an early draft of the RTS, but they also explicitly rejected an industry request to reinstate a percentile tolerance (including a request from us).

Nevertheless, you will have noticed that the STAC-TS.ALE tables above don’t just show the maximum of each sample distribution. They also show percentiles. Why do we bother?

The answer is that the question just won’t go away. The tables above show that there is often a huge difference between percentiles and the max (the 100th percentile). Engineering for the max versus engineering for two, four, or even six nines can have significantly different costs. For example, suppose our budget for application-timestamp error is 40 microseconds (what’s left after allowing, say, 10 microseconds for time distribution infrastructure and 50 microseconds for potential faults that throw my host into holdover for up to 20 minutes before being remedied). If our target is to have the 99.99th percentile below 40 microseconds, the system in Figure 2 is compliant. But if we engineer for the max, we need to take the remediation steps that result in Figure 3.

Furthermore, some regulators have told us that they understand it’s not economically feasible for a firm to guarantee absolute compliance with the regulation at all times. They liken the regulation to a speed limit. Speed limits are expressed in absolute terms, but police choose which speeders to pursue and which ones to let off the hook (such as men with wives in labor).

The critical question is what the regulators will consider an exception. Is an exception a “black swan” technical problem that occurs only once after years of perfect compliance? Or is it a completely foreseeable issue that occurs with low frequency every day (such as pauses in OS scheduling)?

For example, if I engineer for compliance at the 99th percentile, will the regulator consider the 1% of my timestamps that violate RTS 25 to be exceptions? If an application is timestamping 100 events per second on average, that means that I will have something like 25,000 events each trading day whose timestamps violate RTS 25. Will regulators view these violations as wives-in-labor type situations? Some of the firms I speak with seem to think so. Others are more skeptical.

In any case, until enforcement becomes clearer (perhaps through the first RTS 25 cases that get enforced) the level of tolerance on which the firm bases its engineering is a choice that the firm cannot avoid. The STAC-TS Working Group does not stake a position on this question. Instead, it is offering the industry tools that provide the data to support a range of decisions.

Whatever tolerance you choose, don’t fall prey to another myth that is circulating….

<Next: MYTH #5 - If 99.x% of my timestamps fall within RTS 25, the odds that regulators will discover a timestamp that is out of compliance is very small.>

<Prev: MYTH #4 - There’s nothing to do about application-level error on a given server.>