Senior Site Reliability Engineer, Cortex Machine Learning Infrastructure

Full-time

Company Description

Twitter is what’s happening and what people are talking about right now. For us, life's not about a job, it's about purpose. We feel real change starts with conversation. Here, your voice matters. Come as you are and together we'll do what's right (not what's easy) to serve the public conversation.

Twitter es lo que está sucediendo y de lo que la gente está hablando en este momento. Para nosotros, la vida no se trata de un trabajo, se trata de un propósito. Sentimos que el cambio real comienza con la conversación. Aquí, tu voz importa. Venga como está y juntos haremos lo correcto (no lo fácil) para servir a la conversación pública.

Job Description

Who We Are

The Cortex organization provides managed machine learning platforms, tools, processes, and workflows to developers at Twitter. We win when our customers win by helping our users stay informed and share and discuss what matters in service to the public conversation. Twitter is increasingly becoming an AI-first company, and Cortex is at the nexus of that evolution.

Our Cortex SRE team uses state-of-the-art open-source and proprietary technologies. We operate at a scale that few other companies do. We embed deeply with development teams with a focus on up-leveling services and increasing automation. We operate both on-premise and in multiple clouds, and with both online serving and offline modeling services. Joining our team is an opportunity for an SRE to grow into the machine learning world over time, and contribute to a broad range of tasks including contributing directly to the applications.

We care deeply about:

Enabling Ethical AI.
Engineering excellence such as good design abstractions, API stability, scaling, leading standard methodologies for other engineers to follow, and solid documentation.
Staying abreast and compatible with a quickly shifting technology landscape for Machine Learning platform components and related open source solutions.
Creating the best Machine Learning Platform environment for Twitter that provides an exceptional developer experience for our engineering customers, and provides value to Twitter’s users. We offer Machine Learning as a managed service to the rest of Twitter Engineering.
Encouraging creativity and innovative solutions.

Our current projects include:

Creating a high-scale, Kubernetes-based, Machine Learning Model serving solution in a hybrid cloud environment.
Establishing Kubeflow on GCP as a managed offering at Twitter
Enabling model training in the GCP environment
Serving models to partner dev teams using services AWS
Establishing tooling and other productionalizing infrastructure that spans AWS, GCP, and on-prem environments.
Enabling and sustaining GCP Infra/Platform components for broader use in Cortex platform; e.g. AI Platform, Dataflow, Data Proc, etc.
Improving operations of ML Platform services
Hosted notebooks
Centralized ML Metastore
Centralized ML Dashboard

How you'll work:

Our team focuses on serving ML as a ‘managed service’ to our Twitter engineering customers. This requires an understanding and focus on large-scale online serving systems of all kinds (vs offline like Hadoop).
You will embed deeply with your Software Engineering (SWE) counterparts and take an active role as a co-owner of production services to ensure services are built, maintained, and operated in a reliable and scalable way.
You will be part of the successful delivery of new features and services, as well as the day-to-day successful operation of existing services.
Collaborate with your SWE partners to drive operational health improvements, root cause analysis, postmortem discussions, and their associated remediations that serve to improve reliability and sub-linearly scale operations.
Partner with both SWE and SRE to use techniques to reduce business risk.
Perform infrastructure & configuration management, deploys, capacity modeling & planning, and incident mitigation.
Identify common patterns in challenges with operating services in production, partner with others to design and implement reusable solutions and/or other multi-functional work that drives down the complexity, difficulty, costs, and risks of operating the business.
You’ll be a member of a service on-call team, in the same on-call group as your SWE partners.

Who you are

We are looking for SREs who are passionate about enabling AI, have a desire to grow themselves and learn new technologies, love working in collaborative teams that are committed to serving their customers. You don’t need to have mastered Machine Learning to join this team!

Your responsibilities include

Informing and accelerating GCP and AWS Infrastructure adoption standard methodologies (sustaining and improving User Onboarding, IAM, Image Management, Twitter Systems Integrations, Security, et al)
Traditional SRE/Operational support scopes like automation, monitoring, workflow management, GPU Cluster Management, OS/Kernel Upgrades, RPM/Python Dependency Management, Bare Metal Host Management/Puppet Manifests, CI/CD, monitoring, etc.
Partnering and supporting existing Cortex Platform teams with Operational guidance and expertise on various project initiatives
Creating tools and automation for Operational support and management for DS/ML use cases
Supporting various users and developers with operational issues (e.g. “I’m having trouble scheduling GPU jobs with Persistent Volumes”)
Capacity Planning and autoscaling.
Maintaining the version updates of Tensorflow / PyTorch et al
Partner with Twitter’s Platform and Data Platform organizations to improve, enhance and influence direction and integration opportunities
Partner with teams to improve, enhance and integrate with the company’s GCP/AWS Adoption & Management strategy

Qualifications

Experience in either AWS or GCP is required. Experience in both is a plus but not required.
Experience with ML is not required. The ML team is happy to train you into this world as needed.
Experience with Kubeflow, Scatter-gather systems, and offline systems like Hadoop is a plus, but not a requirement.
4+ years of handling services in a large-scale distributed systems environment, preferably services on GCP e.g. BigQuery, etc, and/or AWS.
Expert knowledge of Linux operating system internals, filesystems, disk/storage technologies, and storage protocols, and networking stack.
Expert knowledge of systems programming (bash and shell tools) and practical, validated knowledge of at least one higher-level language (Python, Go, or Scala).
Comfortable working with both on-prem and cloud-based infrastructure (AWS, GCP) in terms of deployment, support, monitoring, administration, and troubleshooting.
Experience using containerization software such as Kubernetes, Docker, Mesos.
Track record of practical problem solving, excellent communication, and documentation skills.
Proven understanding of systems and application design, including the operational trade-offs of various designs.
Be comfortable operating as a member of a team, and work well with a myriad of personalities at all levels.
Solid understanding of distributed systems design, scaling, durability, and security.

Additional Information

A few other things we value:

Challenge - We solve some of the industry’s hardest problems. Come to be challenged, learn, and thrive as an engineer.
Diversity - Diversity makes us a better organization and team. We value diverse backgrounds, ideas, and experiences.
Work, Life, Balance - We work hard, but we believe with hard work should come balance.

We will ensure that individuals with disabilities are provided a reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us to request an accommodation.