Tuesday, November 4, 2014

The Art of Survivability


In the rough and tumble world of the development of corporate software most companies establish "checkpoints" to assure that their engineers don't get too far down the river before they realize that they needed a paddle (or an outboard motor). How they implement these inspections (with paperwork or process) varies greatly, but the overarching principles are standard. What a company needs from its software operationally in order to prevent a black-swan failure is:

1.       Scalability – developers need to demonstrate right from the start that each successive implementation will meet the *eventual* maximum usage envisioned (in terms of concurrent users and fully loaded database tables).

2.       Trackability – every release should document what changed or got added -- in a public location -- for everyone to see.

3.       Knowledge base – all modern software is so immensely complicated and rapidly changing that it is impossible to keep the technical documentation up-to-date.  Therefore to support maintainability at least two full-time Permanent (in-house employee) software developers need to know everything about any particular system.

4.       Operationally Repeatable – Production controls must exist to allow for the restoration and re-running of a system from any arbitrary point.

5.       Archivals and Deletions – You need utilities to remove old data to prevent the gradual quicksand degradation of performance that will  suddenly cause a cascade of timeouts.

6.       Implementation Rollback – anytime a major software change gets implemented you need a way to roll it back.

Those are the biggies in general philosophical terms. Yeah this list probably doesn’t help much down in the weeds for what documents or gateways to put in place, but it’s a good start for group discussions to get there.