Disney Streaming is looking for a Sr. Software engineer to execute the technical vision and solutions for Engineering Reliability. This group is responsible for improving the reliability, performance, and user experience across Disney+, ESPN+, and Hulu. Additional focus is de-risking launches, which include big titles, feature rollouts, and global expansion.
This is an opportunity to join a team and optimize the user’s experience through analyzing Disney Streaming’s entire ecosystem, identifying areas of improvement, and building custom tooling to proactively increase issue identification and triage. The analysis, identification, and root cause analysis you make will help drive engineering initiatives and priorities.
To be successful in this role, you need to be a strong engineer; always striving to solve complex problems at scale. Preferably having a strong background in distributed services as well as client architecture. Whether you’re in New York, California, Seattle, or remote, we provide opportunities to elevate your career and to transform an industry.
- Partnering with cross-functional teams on identifying, defining, supporting, and improving reliability and performance.
- Building, maintaining, and improving custom tooling that provides automated analysis of the ecosystem’s health while proactively identifying issues prior to becoming an incident.
- Identification and forensic analysis to find root cause analysis and provide recommendations on future cross functional changes needed.
- Ability to communicate, discuss, and champion reliability efforts
- Gather and analyze a variety of data points (qualitative and quantitative) and distill that information down into key insights for engineering.
- 5+ years of software development experience
- 3+ years of project or team lead experience
- Advanced knowledge of systems applications and hardware, server architecture, operating platforms, Cloud technologies, and internet and web applications
- Reliability efforts and initiatives in a technical organization
- Experience with modern SRE practices
- Cross-functional work with teams of different expertise
- Service Reliability/Operational experience running large scale high-performance systems & Internet services
- Comfortable analyzing various logs (HTTP Archive, and various client/service logs)
- Ability to query and visualize data using Grafana, Kibana, ElasticSearch, CloudWatch or similar
- Defines the processes used in identification of the root causes of operational issues and leads root cause analysis; resolves problems and/or recommends solutions for implementation by others.
- Oversees the development and implementation of tools, automation, and scripts to facilitate multiple platform maintenance, operational efficiency, reliability, and administration.
- Subject matter expertise in one or more of the following areas: iOS, Android, browser, or streaming technologies.
- AWS products and services (CloudWatch, Athena, DynamoDB, ECS, ElastiCache, Elasticsearch, Kinesis, Lambda, S3, SNS, SQS, etc…) or other cloud providers.
- BS in Computer Science, Electrical Engineering or Computer Engineering (or equivalent professional experience)
- Datadog experience
- Big data experience with extracting (e.g. SQL, Snowflake, Databricks, Athena)
This role is considered remote, which means the employee will work remotely on an ongoing basis and will not have an assigned workspace at a Company designated location.