Gibson Wasukira

Data Engineering | Data Architecture | Solutions Architecture


PART 2 - ARCHITECTURE REQUIREMENTS GATHERING


In this second part of the series, we shall gather information that will drive the architectural design of the data platform. This is the genesis of our data platform. At the end of this phase, we should have a slight inclination of what kind of data pipeline we will build

Our primary role as data engineers is to build data pipelines that deliver data to end users

Before we start building our pipeline, we need to get answers to the following questions:

Determine the Stakeholders

Let’s say our stakeholders are:

What data needs to get consumed?

What governance policies should be in place?

Remember to always follow the Principle of Least Privilege

Compliance

What is the budget?

Data Sources – What systems are used to capture the data

With the above information, we then move onto identifying the the systems from which to extract this data from

In Data Engineering terminology, Operations/transaction data is referred to as System of Record (SOR) data.

So, the SOR data sources will be:

So, we have relational and nonrelational models, our ERP, CRM, and BSS are backed by relational data stores, and these are Postgres DB and MySQL DB

The nonrelational “NoSQL” data stores have 2 categories, Document databases and Graph databases, with document DBs data is represented as self-contained; relationships between one document and another are rare in this case

Graph databases will be used later in this series, but they are used in applications where almost everything is related to anything

Therefore data sources are.

With the above information we are ready to start building our architecture



DEPLOYING A COST OPTIMIZED RESILIENT DATA PLATFORM SERIES

Building a Scalable Cost Optimized Resilient Data Platform

Part 1 - Building Blocks

Part 3 - Architecture Design

Data Engineering

More than hello World!!

Latest Topics
  • Data Architecture
  • Apache Spark
  • Apache Iceberg
  • Data Modelling
  • Datalake