Site Reliability Engineer - Data Platform
- Architect and implement data infrastructure solutions (self service) that support the needs of 10+ business units and over 100 engineering and data analysts
- Utilize Infrastructure as Code (IaC) principles to design, provision, and manage both on-premises and cloud (AWS) infrastructure components using tools such as Terraform
- Collaborate with teams to ensure seamless integration of data-related services with existing systems.
- Develop and maintain automation scripts using bash/shell scripting and to automate operational tasks and deployments.
- Enhance and manage CI/CD pipelines to facilitate consistent software deployments across the data infrastructure.
- Enable engineering self-service under tight security requirements using ChatOps and GitOps methodologies
- Implement robust data monitoring and alerting solutions to proactively detect anomalies and performance issues.
- Manage user and machine authentication and authorization mechanisms to ensure secure access to data and resources.
- Evangelize and implement role-based access control (RBAC) and permissions for a multitude of user groups and machine workflows across different environments
- Design and deploy MLOps platforms, using AWS Sagemaker and GitOps methodologies.
- Manage and maintain real-time streaming data architecture using technologies like Kafka and Debezium Change Data Capture (CDC).
- Ensure the timely and accurate processing of streaming data, enabling data analysts and engineers to gain insights from up-to-date information.
- Utilize Kubernetes to manage containerized applications within the data infrastructure, ensuring efficient deployment, scaling, and orchestration.
- Implement effective incident response procedures and participate in on-call rotations.
- Troubleshoot and resolve incidents promptly to minimize downtime and impact.
- Collaborate with data analysts, engineers, and cross-functional teams to understand requirements and implement appropriate solutions.
- Document architecture, processes, and best practices to enable knowledge sharing and support continuous improvement.
- Enable environments for ML experimentation
- Create and manage MLOps flows for training, validation and deployment of models
- Implement efficient, reproducible production deployment of ML models for inference
Skills you should HODL
- Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent experience).
- Proven experience (5+ years) working as a Site Reliability Engineer, Infrastructure Engineer, or similar roles, with a focus on data infrastructure and security.
- Experience with real-time data processing technologies, such as Kafka and Debezium
- Strong expertise in cloud technologies, particularly AWS and (HashiCorp nice to have).
- Proficiency in Infrastructure as Code tools such as Terraform and Atlantis.
- Experience with containerization and orchestration tools, particularly Kubernetes.
- Solid understanding of bash/shell scripting and proficiency in at least one programming language.
- Familiarity with CI/CD deployment pipelines and related tools.
- Knowledge of HashiCorp products like Vault, Nomad, and Consul is a plus.
- Strong problem-solving skills and the ability to troubleshoot complex systems.
- Expertise in zero-trust architecture and service meshes is a plus
- Experience with data-related technologies (databases, airflow, data warehousing, data lakes) is a plus.