So you have completed the designs, the branding, the wireframes, the use-cases and onwards, through the development, the iterations, the testing, the enhancements, the tweaks and the changes...
You've done the penetration test, the accessibility test, the usability test, the stress test and the test of the tests themselves.
By now you'll have invested a significant amount of money and definitely a substantial amount of your own precious time.
And now your website or online business system has 'gone live'. You are now the proud owner of a site/application that is available to the world. And, as is totally understandable in todays' world of the internet, your site/application will be available to your customers all the time, 24/7/365. There will be no downtime - fact!
A week passes.
And then a number of unrelated events all occur in the space of a couple of days:
- You notice a part of the site/application that you want to improve and you want it changed,
- Your business diversifies and you need to rework a whole section of the application to provide changed features,
- The company that provides the platform that you've built the site/application upon, release a new version of the platform,
- The operating system underlying your site needs a security patch installing,
- The company that you are hosting with needs to replace a piece of networking kit.
Contacting your various suppliers, you are told that resolving each of these items will involve "some" downtime. Before you know it, your site/application is offline for "some" amounts of time. Your customers are getting the dreaded 404 site not found message, and are rapidly losing confidence in your company. How could this possibly be happening?
Well, if we're honest with ourselves, and take a closer look at what is actually involved in maintaining the 'uptime' of a site/application, we quickly see that for all but the simplest of brochure-ware websites, it's a very complicated business indeed.
The list of commodities and infrastructures that need to be in place to even serve a single page to the internet is a long one. From hosting space and cooling at the very physical end, through power and internet connectivity, and then on to the servers and networks themselves. Then there's the operating system and the platform, and we've still not got to the code of your site yet. It's a small miracle that your website can be presented to the internet at all.
With so many interactions and points of failure in each of the various subsystems, outages are common place in a lot of computing systems, and hosting a website/application on the internet is no different. In order to avoid your customers noticing failures of these subsystems you will have had to have designed and implemented a highly redundant "super system" that copes with the failure of any of its individual components. As you might expect, this isn't an easy undertaking and by its nature, isn't particularly cheap.
This week, at PDMS, we are going to celebrate a 100% uptime value for a large site which serves nearly 12 million pages a month. We report this value over a rolling 3 month period, so we haven't missed a beat for around 100 days now. It's not 99.9% or even 99.99%, it's 100% uptime. And I'm really proud of that fact.
Achieving this, has taken a lot of engineering at all levels. We've had to implement a very resilient and redundant network infrastructure, a large number of devices from firewalls and switches to routers and load-balancers. At the server and operating system level we've employed virtualisation technologies that react to changes in the underlying infrastructure. At the database level we've adopted high availability strategies to ensure data consistency and availability. And, at the application level, the code has been specifically written to allow for a distributed network.
Having been asked by the boss's boss at the end of last week, I've spent some time over the past few days, trying to think of the one thing that has made this possible - and I think that it can be summed up as follows - at PDMS, I'm fortunate enough to work with a very talented, diverse set of engineers, who are all under one roof.
Being able to provide all of the skills necessary to design, build, implement and support a system of this scale, all from within one company, gives us the ability to break down the difficult communication issues involved with multiple suppliers. Everything needed to provide a 100% uptime is managed by people that I can physically go over and talk to.
At some point, we're going to 'miss a beat', our unbroken run of uptime will be resigned to history and we'll have to start the counters again, and try to break the record that we're setting right now. For the superstitious I guess that writing this article has cursed the whole thing, we'll have to wait and see!