Are you looking for a unique opportunity to create a Site Reliability function with the responsibility to plan, design, implement and lead a group who are in charge of delivering growth and industry-changing strategic objectives?
Do you want to build next generation capabilities that power merchant-first ecosystems? Become part of a curious, driven, and pragmatic team who believe in outcomes over outputs and apply today!
* Growing our Site Reliability function
* Play a critical role in designing/ developing of tooling, monitoring, control, self-service reporting, and analysis approach
* Establish policies and procedures that govern incident, change, and problem management protocols
* Monitoring and remediating systems, security, and network issues using various application and network management tools
* Interfacing with internal/external customers on operational issues by dispatching on-call engineers, facilitating communication, and driving resolution to events via standard operating procedures
* Tracking escalations and other key performance indicators
* Provide application administration in a 24x7 environment based on root cause and analysis of logs, alerts, and various other diagnostic tools
* Work with peers to create a positive environment within the team
* Architecting and developing solutions/ roadmaps for monitoring of various systems that constitute the operating environment
* Leveraging telemetry in an IT setting for alert response and troubleshooting
* Work across different teams to create innovative solutions that produce high availability, scalability and reliability.
* Provide technical leadership and do technical hands-on scripting, tooling, automation for continuous operations.
* Detect incidents based on monitoring tools, notifications, and log files.
* Develop new and modify existing monitors as needed.
* Triage incidents and perform documented steps to resolve when a known error is identified.
* Logging incidents within the Incident Tracking system, clearly documenting symptoms needed for others to investigate the incident.
* Act as "incident owner," escalating to other support groups and following the status of the incident until it has been confirmed to be resolved.
* Work closely with technical support, security, engineers, customers, and other groups as needed to narrow investigative efforts and resolve incidents.
* Monitor running jobs for operational impact. Identify scheduled job failures.
* Maintain critical documentation assets, such as customer contact lists, escalation procedures, scheduled job inventories, and operational "run-books."
* Provide support via phone or pager on a scheduled basis as part of an on-call rotation
* 10+ years in a Senior technical role - DevOps, Software Engineering, System or Support Engineering position.
* Demonstrated experience designing, installing and configuring monitoring solutions
* Solid understanding of monitoring fundamentals associated with SNMP, WMI, Synthetic Transaction Engines and experience with various commercial, open source and homegrown monitoring packages and methods (e.g., Splunk, Nagios, Zabbix, OneSite, Gomez, CA, HP OpenView, etc.).
* Strong scripting skills with languages such as PowerShell or Python.
* Understanding of Object Oriented languages such as C# or Java
* Solid understanding of application level monitoring tools and techniques, including Open Tracing, Open Telemetry and APM tools (e.g. Elastic, DataDog, New Relic etc.)
* Solid understanding of networking, including network devices, subnets, and routing protocols; ability to take and interpret packet captures (Ethereal, etc.).
* Solid understanding of systems, including server hardware, Windows and Linux operating systems, iSCSI/FC SAN/NAS/DAS storage, Hypervisor/Virtualisation (VMware, Hyper-V).
* Proficiency in AD/DNS/DHCP.
* Independently implement and build tools and test significant features and capabilities
* Outstanding written and verbal communication and interpersonal attributes.
* Strong technology aptitude, and leadership.
* Clear and concise communication, both written and oral.
* Excellent analytical and troubleshooting skills
* Home Office allowance of £500
* £60 monthly contribution towards energy bills
* Unlimited Annual Leave
* Private Medical Insurance
* PLUS many more
***For more information, please contact Declan via email: email@example.com. Please note that this opportunity does NOT provide sponsorship and UK residence is needed prior to application.***