Data engineering in python
16-09-2024
General tools
- compute
- Databricks (which uses Apache Spark as it's processing engine and allows for in-memory caching and optimised query execution) and DeltaLake (which provides the data lake / warehouse)
 
 - storage
- relational (MSSQL, PostgreSQL)
 - non-relational (MongoDB, Redis)
 - object (Azure blob / Amazon S3)
 
 - containerisation
- Docker (automate and manage the deployment of applications inside containers)
 
 
Python tools
- Airflow (task-centric, great open source support) and Dagster (asset-centric, great testing and debugging support) for orchestration
 - Polars and Pandas for data processing (Polars is becomming more mainstream, it uses Rust and Apache Arrow on the backend)
 - DBT for data transformations
 - FastAPI for building API's (this is a very popular lilbrary, and works on top of Pydantic which handles data parsing and validation)
 - SQLAlchemy for database connections (they released a new 2.0+ major release a few months ago)
 - Poetry / UV for project, package and dependency management
 - Ansible for infrastructure management and configuration
 - Opentelemetry for monitoring (open source, language agnostic, can capture rich metadata)