Path to operational excellence

Management Focus

Management focus is important in ensuring efficient and effective operation, and managers should prioritize and dedicate their attention, resources, and effort towards improving integration operations. They should identify and set priorities, establish strategies and goals, and monitor progress to ensure that the company’s objectives are met efficiently and effectively. Management should also focus on continuous improvement to sustain operation excellence.

Improve Monitoring

To improve monitoring, it is recommended to monitor all servers and processes manually at fixed intervals, such as every 4 hours, and manually check the infrastructure usage status like CPU and memory. Additionally, monitoring can be improved through tools such as Splunk and Qlik, and the alerting mechanism can be enhanced by implementing automatic alerts/notifications upon process failure.

Resourcing and collaboration

Resourcing is crucial in ensuring efficient operation, and it is important to onboard resources with the right skill sets. These resources should first contribute to the development team and be trained frequently on security issues to prevent security incidents. They should also be encouraged to share knowledge across the team after issues are identified and resolved.

Coordination between the development team and operation team is essential, and DevOps methodology should be followed. Team lead should ensure that operation-friendly code is developed. The development team should also follow design principles and best practices and knowledge transfers should be completed between the operation and development teams, and the operation team should provide continuous feedback to the development team.

Analysis and reporting of issues.

Analyzing failures is crucial, and events generated by different processes should be monitored, failures should be analyzed to find the root cause, and corrective and preventive measures should be taken accordingly. Issues should be anticipated beforehand, knowledge articles should be created about issues, and an active to-do list should be maintained. If external issues cause processes to stop, they should be started after issues are resolved.

Status reporting to the management is essential, and the operation team should provide insight into the operations to the management. They should present a dashboard of incidents, service requests, failures, and improvements made and highlight major issues like failures, providing reasons for the failures. The operation team should be transparent to the management and show the corrective measures taken, asking for management support if needed.

Compliance and Security

To ensure compliance and security, internal audits should be done on frequent intervals to check process compliance. This audit should include all aspects of operational governance to ensure the team is fulfilling their duty. User access to the application should be checked and verified periodically, and the access of users who have left the team should be removed. Patching in HA environments should be done in sequence so that it does not impact the services. After patching is completed, server status should be manually checked. Pen test/security findings should be mitigated with priority, and vulnerability should be managed through patch updates or other measures. It is recommended to proactively look for security certificate renewals before expiry. DR testing processes should be deployed in production beforehand, and actual processes running in production may  not be used for DR testing as it might interrupt current business operations.