Is Software Antifragile ?

2022-09-12 1485 words 7 minutes

Contents

Nassim Nicholas Taleb’s books are influential in many areas and across many disciplines. It can change how we view and interpret the world. In The Black Swan Theory and Why Agile Principles Fit Better, we went on an adventure trying to explain why creating software with agile principles fit better than other approaches in a world described by Taleb.

In this post we will visit Antifragile, and see if we can find some more lessons for software engineering.

Before moving forward, I would like to thank awesome Bilgin Ibryam for his great post: From Fragile to Antifragile Software

I agree with all points except the border between resiliency and antifragility. We will come to that.

Let’s first visit Antifragile.

Antifragile

Antifragility concept is created by Taleb, as an answer define the opposite of fragility. While the book covers many important parts about economics, randomness, volatility, options, history, detecting fragility and more, I will only rely on the state of being antifragile.

But before diving in to that, let’s first visit the other states. If a complex system isn’t antifragile, then what can it be ?

Alternatives to being antifragile

Complex systems - not necessarily engineering works, it could be anything, socioeconomic systems, education systems, markets, industries, a person (Fat Tony ?) - can be categorized in to four states:

1. Fragile

Fragile systems are harmed when exposed to stressors and therefore we try to predict and avoid such events. They don’t like randomness or volatility. They need consistency, endless stability with no surprises. While this drives us to predict, how far we can be successful at this for rare events that are consequential, is the topic of The Black Swan.

2. Robust

Robust systems resist shocks, don’t crumble, but stay same. If the system artificially created by us, it means we have certain acceptance about the world and we added measures to resist up to a certain point.

3. Resilient

Resilient systems are designed to adapt themselves against volatility. One could say that, when configured properly, public cloud systems, kubernetes are designed for high resiliency. But are they antifragile ?

4. Antifragile

Antifragile system gains and becomes stronger when exposed to shocks, stressors, randomness, volatility. It actually likes those events and it’s complete opposite of fragile.

Another important aspect of antifragility, complex systems that thrive on stressors, volatility, randomness tend to weaken and die if they are stripped of those.

Software .. Fragile or Antifragile ?

Now let’s think about software. Is it fragile or antifragile ? Does it love stressors ? Does randomness make it better ?

To give an answer, we need to first see what is a stressor for software.

What does stressing a software even mean ?

Stressors of software

Getting executed is the first stressor of software. When not executed, software is merely a series of bits with no impact or interaction with anything else.

Humans and other systems that interact with our software are another group of stressors. Other interacting systems include not just legitimate systems, but also intruders, misconfigured or buggy network devices, operating systems, runtime environment that optimizes/executes our code, eventually everything that might have impact on our processing.

So, if we are going to say our software is living, serving a purpose, we must have other entities that interact with it.

This implies, software is constantly exposed to stressors while it lives. It’s natural part of its life. It’s a ground fact.

But does it get better when it’s exposed to those stressors ? What does getting better mean for software ?

Here, we shouldn’t confuse resiliency for antifragility. As Taleb also stated, resilient endures shocks, adapts but it stays same, it doesn’t break. It survives for sure. But surviving is one thing, getting better is another.

Our magically scaling workload might be resilient. We design it to survive certain conditions and it does that, only that. But after enduring a traffic spike, it doesn’t get better.

What if our software provides output based on machine learning ? What if feeding it with more data helps it return better results ? Is it considered getting better ?

No. Because the code is still the same. Every single capability for resiliency is there because one way or another, we have foreseen and added them. (Imagine cluster with scaling disabled vs traffic monitored and scaling rules are applied)

Such self-improving (instead of self-healing) capabilities may become normal in the future. Today, we refactor the code to make it serve better for a purpose.

Coming back to the question, is software antifragile ? Does it get better when it is exposed to shocks ? Do our workloads love randomness ?

No .. and yes

While software doesn’t get better by itself when exposed to stressors, we can make it better by meticulously observing and changing it.

As engineers, we can observe how our software interacts with its stressors, take lessons from them and make it better by constantly modifying. We keep the software antifragile by constantly inspecting and refactoring it.

Why did I say “we keep” instead of “we make” ? Because making implies an activity with an end. If a software is alive (executed, exposed to stressors) but not maintained, it slides back to first resilient, then robust and then finally fragile for the simple fact that, the conditions and stressors will never play easier, it will only become more destructive.

But do we have to put our software directly on wild to get such empirical data ?

Sure no. While empirical data from real life returns priceless information (to those who know how to look and interpret), various testing schemes provides us similar information without taking the risk of business impact.

So we can say, software is potentially antifragile. It is up to us.

Fragile software, fragile organizations

Many of us have witnessed solutions that are treated like a rare chinese vase in organizations: A seemingly functioning solution with nicely written “test reports” on confluence and tons of assumptions how everything’s great.

For such products, disaster isn’t a question of “if” but “when”.

Many organizations that actually walked the path of the proper testing, do it only enough to prove that what they create actually works. This is bare minimum. It doesn’t create an opportunity to gain antifragility.

One may ask, if a test discovers a feature not actually working, can we not count it towards antifragility ?

No. The reason is below.

Destructive testing as means to gain antifragility

Testing is always an essential, non-separable part of software engineering when the product needs to meet certain quality levels. We don’t argue whether we should do it or not. Since software engineering is an empirical discipline, testing our solution is the only way to inspect and learn from what we have created without creating a business impact. Our next steps are heavily impacted by what we learn during testing.

Therefore, if other users and interacting systems are stressors of our software, testing is a great way understand what we create and improve.

This leaves us a clear but challenging path to keep the state our software antifragile.

We need to try and break our creations constantly, we need to actively look for limitations, deficiencies, rainy day scenarios, edge cases before they happen in real life.

Why so harsh ? Why can’t we just live with tests that focus on functionality ?

Because, there is no discovery of shocks, impact of randomness with plain tests. Happy path testing, no matter how thorough, just means that the solution is working as intended. Going forward for antifragility requires breaking it beyond what is asked/expected. There lies the uncharted area, unknowns, the territory without a map, for our solution.

Surely this doesn’t mean that all our limited resources must be all channeled to this effort. This needs to be taken as a part of the life journey of our solution and a pillar for its survival.

Issues found. Now what ?

Issues will be found for sure. If not, it means there is something naive in our testing approach and we aren’t pushing enough or properly.

What is going to happen after we find a problem ?

We record it as a bug and treat it as if it caused business impact and sail it through our triage. We make sure that even if it’s decided to not to fix, it becomes an explicit decision.

Conclusion

Just like the black swan theory can connect the dots between the real world and lean/agile principles, antifragility concept helps us strenghten the idea about why destructive testing and observability is vitally important.

We may create a software that is robust, resilient. It can endure the storms that we have foreseen and already implemented countermeasures. However gaining antifragility, becoming stronger when faced with stressors, without causing business impact is possible only if we continuously invest in test approaches that are designed to break our solution.