Lead Site Reliability EngineerPrimary Location US, Texas, Plano Date posted 08/05/2021
Job Title:Lead Site Reliability Engineer
Role Overview:As the SRE Lead, you will lead 247 SRE team of experienced individuals, and will be accountable and maintain the appropriate service levels (availability, latency, and reliability) to serve our customers' needs, and reduce the friction for managing change, while being strategic about capacity, and always managing performance. Your responsibilities will include setting team priorities and goals as well as engaging with DevOps, Engineering & other teams to understand and support our needs and programs. Every SRE manages the availability, scalability, security, performance, cost, and compliance requirements of our services. You will ensure applications on-boarded to SRE are instrumented for full-stack observability and continuous testing, introduce continuous improvement, integrate into IT Service Operations, and share support responsibilities for critical customer journeys, business flows, and applications. They will also forge the strategy for AIOps through AI/ML and NoOps, delivering strategic innovation to improve availability, stability, and resiliency.
McAfee is a leader in personal security for consumers. Focused on protecting people, not just devices, McAfee consumer solutions adapt to users’ needs in an always online world, empowering them to live securely through integrated, intuitive solutions that protects their families and communities with the right security at the right moment.
About the Role:
- Lead a 24*7 team of Site Reliability Engineers working on several key services and technologies to support our products in a resilient, scalable, compliant and sustainable manner. The initial response and evaluation of all operational incidents and request.
- Oversee service operations. Develop outstanding operational procedures based on ITIL framework and industry ITSM best practices in delivering services. · Create and manage day to day processes including Change Management, Incident Management, and Problem Management
- Work extensively to help reduce the Mean Time to Restore (MTTR) & improve Mean Time To Detect (MTTD)
- Develop well-rounded Key Performance Indicators (KPIs) to manage the operational performance of the service provided in delivering product/service support
- Prepare, manage, monitor, and report production service uptime and reliability and work towards the Continuous service improvement plan for recurring incidents.
- Work across Engineering and Support teams to ensure we meet our goals for service reliability, availability, and efficiency.
- Completes Incident retrospectives. Manages the Incident lifecycle and works directly with Engineering, DevOps, IT & other teams for RCA and problem management of high priority incidents. · Ensure security events and alerts are addressed in a timely manner. · Own availability and performance of mission critical services. Improve automation to prevent problem recurrence, and responses to all non-exceptional service conditions.
- Support product engineering teams on SRE related activities to establish optimal SLAs for all pre-defined activities and provide a high quality customer experience.
- Planning and deployment of patches and product enhancements to our environments.
- Conduct readiness reviews prior to moving changes / deployments into higher environments
- Participate early in the SDLC to ensure reliability is built in from the beginning, and creating plans for successful implementations/launches and transition into SRE team smoothly.
- Ensure agreement and coordination with Engineering, project and release/deployment teams.
- Develop productive relationships with business leaders across the organization to identify and remove barriers and ensure applications operation and support are meeting expected levels of service, quality, and performance.
- Continually evaluate and adopt the latest industry technologies to optimize costs and improve processes.
- Provide leadership, strategy, vision and direction in achieving a robust, flexible, scalable, innovative global service delivery model
- Lead by example, both technically and organizationally, and establish credibility with the quality of team's technical execution.
- Create a culture that supports innovation and creativity while delivering high output in a predictable and reliable way.
- Keep the team motivated to go beyond the expected in execution and expertise.
- Mentor, coach, and develop a globally distributed, SRE team.
- Hire and onboard qualified candidates as needed to ensure that we have a sustainable SRE team.
About You: ·
- You have 10+ years of software development and/or technical operations experience, and experience running large-scale applications with minimum 2 years of lead experience and a minimum of 3 years in technical architect or lead experience.
- You have experience in SRE / DevOps, Infrastructure Engineering, and Systems Engineering.
- Experience in defining and implementing highly reliable applications.
- You have experience building, maintaining and operating production systems (> 99.9% SLA) on On-prem or Cloud (AWS).
- You will Monitor, Debug & RCA for any service failure and involvement into complete development and deployment cycle.
- You have advanced knowledge and skills within a specific technical or professional discipline with understanding of the impact of work on other areas of the organization.
Company Benefits and Perks:
We work hard to embrace diversity and inclusion and encourage everyone at McAfee to bring their authentic selves to work every day. We offer a variety of social programs, flexible work hours and family-friendly benefits to all of our employees.
- Pension and Retirement Plans
- Medical, Dental and Vision Coverage
- Paid Time Off
- Paid Parental Leave
- Support for Community Involvement
We're serious about our commitment to diversity which is why McAfee prohibits discrimination based on race, color, religion, gender, national origin, age, disability, veteran status, marital status, pregnancy, gender expression or identity, sexual orientation or any other legally protected status.
- Product Manager – Apps & Experiences Engineering San Jose, California, Plano, Texas, Waterloo, Ontario
- Principal Product Manager – Apps & Experiences Engineering San Jose, California, Plano, Texas, Waterloo, Ontario
- Product Designer - Web Protection Center Engineering San Jose, California, Toronto, Ontario, Plano, Texas