Govern and manage large datasets across AWS environments, including data versioning and resolving quality issues for internal users
Own and continuously improve data ingestion and transformation pipelines for large-scale climate and renewables datasets (weather data, renewable generation data), ensuring quality and timely delivery across the business
Facilitate cross-functional data exchange by ingesting and transforming datasets from other engineering teams and delivering them to stakeholders in their required formats
Build and deploy self-service infrastructure components (AWS Lambda functions, Glue/Athena tables, computing infrastructure) that make data access and preparation seamless for data scientists
Build and maintain robust unit and BDD (Behave) test suites that validate complex transformation and modeling logic
Mentor data scientists on code structure, effective testing practices, and engineering standards
Work closely with internal teams and end-users to understand their needs, address technical challenges, and co-design scalable architectural solutions that serve the broader organization
Define and implement comprehensive data quality assurance processes, including validity checks and proactive diagnosis and resolution of production issues
Partner with data scientists to develop new code and scripts, refactoring them into maintainable, efficient, and reusable functions that prevent future bottlenecks
Design and implement optimization strategies for large-scale data processing and complex modeling tasks, leveraging parallelization and distributed computing tools like Dask for maximum performance and efficiency
Create shared code frameworks, templates, and internal libraries that enforce best practices, contribute to company-wide tooling, and accelerate data science workflows
Requirements
aws
python
pyspark
ci/cd
data engineering
leadership
Strong hands-on experience with AWS services for data processing (Step Functions, Lambda, Batch, S3, Athena)
Deep knowledge of software engineering best practices, including design patterns, refactoring, infrastructure as code, containerization, and CI/CD pipelines