Gibson Wasukira

Data Engineering | Data Architecture | Solutions Architecture


BUILDING A SCALABLE COST OPTIMIZED RESILIENT DATA PLATFORM

Sep 06, 2022


For any business running cloud native infrastructure, the infrastructure costs will increase as the business grows, and cost efficiency is always paramount.

Data engineering requires a high level of autonomy, so as a data engineer you must have a firm grasp of the infrastructure on which your data pipelines run.

You should be able to design, build and operationalize a data platform, then secure and monitor the data processing systems running on it; and for this reason a "You Build it, You Run it!" approach for data engineers especially in startups is always encouraged.

You should be able to troubleshoot issues within a pipeline without any reliance on a third party; the DevOps/Platform engineer might be too busy to be bothered about the intrinsic details of the "Spark Shuffle" process when setting up or making infrastructure decisions and this will have a direct influence on your pipeline performance, it could be either transient or manifest later

If you ran this infrastructure, as a data engineer, one of your KPIs, in a finance perspective will be to keep these costs at a minimum while still meeting SLAs set by business

So how can we build a cost optimized but resilient data platform?

In this series of posts, I discuss the data architecture of a scalable cost optimized resilient data platform, the tooling (toys), the reasons for the choices for tooling, the alternatives to the chosen tooling and the different persona in the ecosystem

The entire platform infrastructure is managed by Infrastructure as Code (IaC), the choice for this being Terraform, and complete project will be shared at a GitHub repo

A cloud-agnostic platform was paramount to this project to alleviate cloud lock-in concerns but were necessary cloud-native services where used.

The overall architecture of the project can be found click to view or download link



DEPLOYING A COST OPTIMIZED RESILIENT DATA PLATFORM SERIES

Part 1 - Building Blocks

Part 3 - Architecture Design

Part 2 - Architecture Requirements Gathering

Data Engineering

More than hello World!!

Latest Topics
  • Data Architecture
  • Apache Spark
  • Apache Iceberg
  • Data Modelling
  • Datalake