ABOUT THE ROLE
· Develop Application team engagement models including SLI, SLO, and SLA metrics
· Lead the development of automation architecture for some of the most critical financial systems in the world
· Streamline and enhance the day-to-day operational workflows of shared services in a 24x7x365 environment located in AWS
· Lead the build of tools to enhance performance, scalability and observability of resources shared between multiple projects in production
· Utilize a wide variety of cloud native technologies to create fault-tolerant, scalable and secure high-performance services and pipelines on a global scale
· Interact with other teams across the organization to define KPIs and evangelize the adoption of best practices in relation to performance and reliability
· Continuously improve observability to ensure the uptime and reliability of our applications and infrastructure
· Troubleshoot issues across the entire stack; hardware, software, application and network within physical datacenter and cloud-based environments
· Lead root cause analysis efforts
· Provide on-call support for shared services and infrastructure.
· Mentor and manage a 4 person team, providing technical guidance and expertise
· Provide project management, oversight and reporting for your team
· Proven track record of designing, building, optimizing, and maintaining applications and infrastructure at large scale in highly regulated environments
· Software development experience using Nodejs and/or Python
· A deep understanding of the Linux operating system, from the console to the kernel
· Ability to work as part of a distributed team.
· Knowledge of CI/CD best practices
· Experience with containers and container orchestration tools (Docker or Kubernetes).
· Deep experience working in the AWS environment