Collaborate with software engineers and AI researchers to define infrastructure requirements and translate them into effective technical solutions.
Troubleshoot and debug complex issues in production environments, spanning infrastructure, networking, and service layers.
Contribute to improving the scalability, performance, security, and cost-effectiveness of the platform.
Contribute significantly to technical design and architecture discussions, considering reliability, scalability, security, and cost-effectiveness.
Design, implement, and maintain scalable and reliable cloud infrastructure on platforms like AWS, GCP, or Azure.
Implement and improve monitoring, logging, and alerting systems to ensure high availability and performance.
Participate in on-call rotation to address production issues and ensure system stability.
Automate infrastructure provisioning and configuration using Infrastructure-as-Code tools (e.g., Terraform).
Work closely with teammates and cross-functional partners (Product, Research, other Engineering teams) to ensure seamless delivery and operation of our services.
Share knowledge, document designs and processes effectively. May mentor junior engineers on DevOps practices.
Take ownership of the technical implementation for specific infrastructure components or pipelines.
Develop and manage CI/CD pipelines to automate the build, test, and deployment of our services.
Evaluate and prototype new technologies and frameworks relevant to MLOps, platform infrastructure, and DevOps practices.
Manage containerization and orchestration using technologies like Docker and Kubernetes.
Write high-quality, maintainable, and well-tested code for infrastructure automation and services.
Requirements
docker
kubernetes
aws
ci/cd
terraform
senior devops
Hands-on experience with containerization (Docker) and orchestration (Kubernetes).
Extensive experience building, deploying, and operating services in a cloud environment (AWS, GCP, or Azure).
5+ years (Senior) of professional DevOps or SRE (Site Reliability Engineer) experience.
Hands-on experience with agentic AI concepts/frameworks (e.g., LangChain, LlamaIndex), vector databases (e.g., Pinecone, Weaviate), or RAG techniques.
Strong problem-solving skills and attention to detail. Good communication and collaboration skills.
Experience with software development best practices (testing frameworks, code reviews).
Experience with CI/CD tools (e.g., GitLab CI/CD, Jenkins, GitHub Actions).
Strong proficiency in one or more scripting or programming languages (e.g., Python, Go, Bash).
Experience with platform or infrastructure-as-a-service components.
Experience with security best practices for cloud infrastructure and applications.
Experience with monitoring, logging, and alerting tools (e.g., Prometheus, Grafana, DataDog, NewRelic).
Solid understanding of Infrastructure-as-Code (IaC) principles and experience with tools like Terraform. Solid understanding of computer science fundamentals (data structures, algorithms, operating systems, networking).
Interest in or practical experience with AI/ML concepts and MLOps practices.
Experience with building and maintaining developer tools or internal platforms.
Bachelor's degree (or equivalent practical experience) in Computer Science, Engineering, or a related field.