Testing the robustness of IT systems.

The robustness of IT systems is always an aspect to which special attention should be paid. Therefore developers use unit and integration tests to test their systems. This raises the question: How sure are we that the system works?

What is still a rather manageable amount of testing in smaller systems generates considerable effort in distributed systems. Unlike a monolithic architecture, problems can occur in distributed systems even though all services are working correctly. Such errors can occur if there are network problems such as too high latencies or if parts of the system cannot be reached due to hardware problems or power failure.

The best way: Chaos Engineering.

Chaos engineering can be used here as a supplement to tests to ensure additional security.

“Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production” [https://principlesofchaos.org/]

Chaos engineering involves various chaos experiments, the aim of which is to prove that the system cannot be brought out of balance at all or only with great difficulty. Unlike unit tests, chaos experiments should use system metrics as verifiable results, not internal attributes. Chaos engineering should show that a systemworks, not how.
Since these experiments should be based on a real basic state of the system, it is recommended to perform them on a production system. It must be ensured that a performed experiment does not restrict the user experience too much. At the same time, the influence must be strong enough to produce measurable results.

Chaos Engineering is not just a buzzword. Every software developer should at least know the basics of proactive error testing to create better, more robust and resilient systems.

“Introduce a little anarchy. Upset the established order, and everything becomes chaos. I’m an agent of chaos…” – The Joker

Have fun. Enjoy coding.
Your INNO coding team.