Of all the DevOps techniques automated testing requires the biggest investment in time, effort, staff training, and management energy. The payoff is equally large. Many of the business benefits that come from the DevOps are due to automated testing. It is the single most important thing we in the larger IT community can do to improve the quality of our products and services.
Automated testing has been used in software development for a long time. With DevOps we now apply it all the fields of IT: to new software products, changes to existing applications, network configuration changes, database changes, VoIP phone configuration changes, and so on. Every change to the IT infrastructure must be tested. Even physical changes must be tested!
Below I list the steps I believe necessary for a comprehensive test of an IT change. I use a change to a DHCP server as an example. The process and pattern apply to all areas of IT. These are meant to be executed for every change and in order.
[Development Environment: The testing below takes place in a development environment. The author of the change, in this case the system administrator, creates a low-fidelity, virtual representation of the production environment. It is isolated from other environments and safe to experiment in. Changes here do not affect production or other systems. It can be destroyed when the change is put into production.]
1. Syntax Test: Test the DHCP server configuration file. Tests that its syntax is correct and that its values are correct. An example of the later is a check that IP address ranges are valid.
2. Function Test: Test that the application is functioning correctly. This checks things like file permissions, directories and files available, and so on. It tells us that the basic functions of the application are working correctly.
3. Configuration Test: Test every configuration option. Every setting you have in the configuration file must be tested to ensure it works as expected. If you have set the server to return a fixed IP address for a given MAC address then you must test it. Likewise you must test that it sends back the correct gateway IP address and so on.
[Integration Environment: The change can now be checked in version control and tested in the integration environment. This is a medium-fidelity, persistent environment. It is also fully isolated. The continuous integration (CI) server triggers when the DHCP configuration file is checked into version control at which point it launches the next set of tests automatically.]
4. Convention Test: This will usually not apply to Unix configuration files but I list it here for completeness. It does apply to code however, so it might apply to automated system configuration scripts. This checks to see that the code follow company coding conventions such as formatting, naming, documentation, and other stylistic conventions. Java developers, for example, use Checkstyle to perform this test.
5. Integration Test: These test the impact and behavior of the change with dependent systems. For example, does the change to the DHCP server work with the DNS, PXE, and TFTP servers? Does it work with the DHCP clients? Configuration file changes must follow the same process that software follows when using a CI server. I won’t go into detail here but readers can learn more from this excellent article.
[Acceptance Environment: Tests 1-5 are very fast. The next set may take a lot of time and are not run immediately. It will be up to QA how these are batched and run. It will likely vary based on need and urgency. This environment is high-fidelity in that it resembles the production environment as closely as possible. It is persistent and models the entire set of applications and network. Like the others it is also fully isolated. It is run by the testers and QA department.]
6. End-to-end Regression Test: These are a more robust version of the integration tests. They should simulate the standard operation and behavior of the company network. It should include clients as well as servers. For example, Windows clients and (simulated) Apple iPads if those are what you use. These tests may take hours to run.
7. Performance and Load Test: These tests check the performance and load impact of the changes. Ideally these test performance at normal load and peak loads. These should be run over an extended period of time to best test the impact of the change. The results are compared to an expected value rather than as pass or fail.
8. Network Test: This check the network impact of the changes. It checks routes, firewalls, ports, network traffic, and so on. It is testing that the change did not harm the network, and that all resources are still accessible or blocked on the network.
9. Security Test: This checks that the change has not introduced a new vulnerability or has enhanced security. It can check for exploits, permissions, and other factors that your security engineers believe appropriate. It should also include vulnerability scans and penetration testing. The security engineers write these tests, just like the network engineers will write the network tests. Under DevOps everyone is a programmer.
10. Policy Test: This is an automated audit of your system. The changes are tested for compliance to company policy, law, and regulation. The key is that they are automated.
11. Usability Test (Manual, Optional): If the change is directly visible to the end user, for instance if the user interface has changed or report format had changed, then a manual intervention is required. These changes must be verified by a person. This is the only manual step in the process.
[Production Environment: Once all the tests have passed and the right person or group has signed off on the them, then the change or changes can go into production. Testing is still not done though. ]
12. Deployment Test: Once the changes have been deployed into the production environment they must still be verified. They must be checked to see if they deployed successfully and that they are working correctly. These test must validate the deployment but not harm the production environment.
This many tests will take time to build. They will be added incrementally and become a critical company resource.
The acceptance environment is a detailed model of the production environment. It could be challenging to build for a large corporate network, in which case it may have to be built as multiple environments. The acceptance environment will be useful beyond testing. It can serve as a “what-if” tool.
It could be used to test disaster recovery plans, plan and test large scale optimizations, plan physical migrations, and many others. A simulation of the production environment offers unique and novel opportunities for experimentation.
In my DevOps experiment I will build these environments and test the flow of changes through these environments. I would welcome feedback on this testing approach.