Lead Site Reliability EngineerPrimary Location US, Texas, Plano Date posted 04/19/2021
Job Title:Lead Site Reliability Engineer
Role Overview:As the SRE Lead, you will lead 247 SRE team of experienced individuals, and will be accountable to maintain the appropriate service levels (availability, latency, and reliability) to serve our customers' needs, and reduce the friction for managing change, while being strategic about capacity, and constantly managing performance. Your responsibilities will include setting team priorities, goals and engaging with DevOps, Engineering & other teams to understand and support our needs and projects. Every SRE manages the availability, scalability, security, performance, cost, and compliance requirements of our services. You will ensure applications on-boarded to SRE are instrumented for full-stack observability and continuous testing, introduce continuous improvement, integrate into IT Service Operations. You will also create the strategy for AIOps through AI/ML and NoOps, delivering strategic innovation to improve availability, stability, and resiliency.
From device to cloud, McAfee provides market-leading cybersecurity solutions for both business and consumers. We help businesses orchestrate cyber environments that are truly integrated, where protection, detection, and correction of security threats happen simultaneously. For consumers, McAfee secures your devices against viruses, malware, and other threats, both at home and away. We want to continue to shape the future of cybersecurity by working together to build best in class products and solutions.
- Lead a 24*7 team of Site Reliability Engineers working on several key services and technologies to support our products in a resilient, scalable, compliant and sustainable manner.
- the initial response and assessment of all operational incidents and request.
- Oversee service operations. Develop outstanding operational processes and procedures based on ITIL framework and industry ITSM best practices in delivering services.
- Create and manage day to day processes including Change Management, Incident Management, and Problem Management
- Work extensively to help reduce the Mean Time to Restore (MTTR) & improve Mean Time To Detect (MTTD)
- Develop well-rounded Measurements to manage the operational performance of the service provided in delivering product/service support
- Prepare, manage, monitor, and report production service uptime and reliability and work towards the Continuous service improvement plan for recurring incidents.
- Work across Engineering and Support teams to ensure we meet our goals for service reliability, availability, and efficiency.
- Complete Incident retrospectives. Manage the Incident lifecycle and works directly with Engineering, DevOps, IT & other teams for RCA and problem management of high priority incidents.
- Ensure security events and alerts are addressed.
- Support product engineering teams on SRE related activities to establish Service level agreements for all pre-defined activities and provide a high-quality customer experience.
- Planning and deployment of patches and product enhancements to our environments.
- Conduct readiness reviews before moving changes / deployments into higher environments
- Participate early in the SDLC to ensure reliability is built in from the beginning, and creating plans for successful implementations/launches and transition into SRE team smoothly.
- Ensure agreement and coordination with Engineering, project and release/deployment teams.
- Continually evaluate and adopt the latest industry technologies to improve costs and processes.
- Provide leadership, strategy, vision and direction in achieving a flexible, scalable, and innovative global service delivery model
- Lead by example, both technically and organizationally, and establish credibility with the quality of team's technical execution.
- Mentor, coach, and develop a globally distributed SRE team.
- Establish goals and measurements to determine success for your team.
- 10+ years of software development and/or technical operations experience, and experience running large-scale applications with minimum 2 years of lead experience and a minimum 3 years in technical architect or lead experience.
- You have experience in SRE / DevOps, Infrastructure Engineering, and Systems Engineering.
- You have experience defining and implementing highly resilient and reliable applications.
- Experience building, maintaining and operating production systems (> 99.9% SLA) on On-prem or Cloud (AWS).
- You will Monitor, Debug & RCA for any service failure and involvement into complete development and deployment cycle.
- You have a understanding of development, debugging, administration and automation frameworks: C#/.NET, PowerShell, Python, Ansible,
- You have experience with Monitoring, logging, APM & other tools: AppD, ELK, Cloudwatch, NewRelic, MoogSoft, Solarwind.
- Experience with CI/CD tools: Git, Teamcity, Jenkin, Artifactory, Ansible, Harness, AWS deploy, Octopus, etc.
- You have experience with container technologies: Kubernetes, Docker
- You have experience with both Windows and Linux Operating Systems
- You have knowledge of AWS cloud service offerings covering serverless and containerized workloads
- Good to have ITIL, HDI, AWS or any other Cloud certifications
- Work some non-standard hours to support a global team and programs.
Company Benefits and Perks:
We work hard to embrace diversity and inclusion and encourage everyone at McAfee to bring their authentic selves to work every day. We offer a variety of social programs, flexible work hours and family-friendly benefits to all of our employees.
- Pension and Retirement Plans
- Medical, Dental and Vision Coverage
- Paid Time Off
- Paid Parental Leave
- Support for Community Involvement
We're serious about our commitment to diversity which is why McAfee prohibits discrimination based on race, color, religion, gender, national origin, age, disability, veteran status, marital status, pregnancy, gender expression or identity, sexual orientation or any other legally protected status.