At the Business of Software conference in Boston in October I hosted a workshop with the title "What do people do to keep their business online?" In this workshop I introduced the basic foundations of how we at Paessler run our business IT and what we do to make it failure tolerant. Those precautions are effectively what keeps our business online even as failures occur. And they will!
In this series of blog articles I will share with you six steps that will help make your business failure tolerant. — See also: The Complete Series
Step 5: Monitor Your Network
Once you've set up backup and recovery plans, make sure you know about the status of all of your systems: Your IT infrastructure including all crucial hardware components, your network connections, as well as your servers.
How Do You Keep Track of All This?
Well, we are The Network Monitoring Company! For all the IT systems I have mentioned above we have set up extensive monitoring. We monitor availability, speed, bandwidth usage, temperatures, cpu and disk usages etc. for any piece of equipment that exposes this information. Using the auto-discovery of PRTG this is actually implemented pretty quickly.
A little more complex is the monitoring of business processes. We have defined a number of step-by-step processes that we monitor (e.g. using the HTTP transaction sensor for the shop and checkout):
- Come to website, read, ask sales/support question
- Come to website, read, download trial, get trial key, activate trial key
- Come to website, put products in cart, pay, get software delivered, activate
- Download log files from web servers, crunch them, review data
We are monitoring all our backup processes. For example by monitoring the youngest file date of the target folder of a server's backup on the NAS we can show an alarm if the youngest file is older than 7 days. This would indicate that the backup was not run since 7 days.
Preferably one should perform the monitoring from several locations/networks and with different tools to make sure a failure is not hidden in a blind spot of the software.
Altogether we have about 3000 sensors for the office network plus 2000 sensors on several off-site PRTG installations that run on Amazon EC2 and 15 other clouds around the globe.
In the next post I will give an overview of the costs that your failure tolerant online business may generate.
At Paessler we have been selling software online for 15 years and we have had hardware, software, and network failures just as everybody else. We tried to learn from each one of them and we tried to change our setup so that each failure would never happen again.
Read the other posts of this series: