Gibson Wasukira

PART 1 - BUILDING BLOCKS

By definition, a data platform is a complete solution for ingesting, processing , analyzing, and presenting the data generated by systems, processes, and infrastructure

The following will be used as building blocks during the data platform architecture design:

• Infrastructure Design
• Secrets Management
• Observability
• Lineage
• Orchestration and Task Registration
• Data Certification
• Query Service
• Data Subscription
• Storage Service
• Data Transformations
• Data Subscription
• Data Governance
• Data Discovery
• Security

Below is a brief overview of these building blocks:

Infrastructure

The services require infrastructure on which to ran, cloud service offerings range from IaaS, PaaS to SaaS, our focus will be IaaS and PaaS

We will work under the assumption that we work for an AWS shop, our infrastructure will be built on AWS, our data pipelines shouldn’t be tightly coupled with the cloud provider service offering, they should be easily portable to a different cloud provider

We should be able to provision and allocate the right number of resources to execute tasks and continuously monitor the resource usage

We should manage our infrastructure using Infrastructure as Code (IaC) and version control as it engenders Idempotency and keeps track of changes to the infrastructure, it also helps with prototyping as it enables you to spin up an environment at a moment’s notice, isolating development, and production environments

Observability

We need to have an operational view of our pipelines and the data platform infrastructure resources, these can be delivered through dashboards, alerting -- for both expected events and any anomalies.

We should also be able to track jobs within our pipeline for status and identify reasons for failure should they occur

Pipeline metadata like data freshness, could also be published to these observability dashboards

We will put in place a centralized logging mechanism for observability

Secrets management

We need to securely store our passwords and rotated without causing significant downtime

Orchestration and Task Registration

Analytics workloads (Jobs ) are a bunch of long running tasks and therefore need to be orchestrated by a state machine, we need to identify the order of the execution, then have the orchestration service build a data pipeline graph.

The orchestration service will initiate and track the execution of task; the orchestration service will serve as our task registry

Data Lineage

During transformations, tasks will have to provide some sort of indication of their inputs and outputs, this metadata will be used by the orchestration service to draw up the DAG graph.

The lineage graph represents relationships between a collection of tasks, this helps with traceability of dependencies

Data Certification

We shall have to embed data quality checks in between tasks to ensure only good quality data gets to dashboards and the subscription service

Data Discovery

To create data democratization and eliminate tribal knowledge within our organization, we will need a single searchable portal from where data users will get answers to questions like:

• Where can I find data about a verb X, noun Y ?
• Who owns certain data sets and how the data is created, Is it trustable?
• Column level statistics, e.g., Mean, Standard deviation
• The frequently the data is refreshed
• Orchestration and Task Registration
• What is the context of the data

Data Subscription Service

We shall have to make it possible to integrate with external services, these could be dashboards, reporting engines, and Rest APIs. These consumers should be able to subscribe to this data subscription service

Query Service

Upon ingestion of data into the storage layer, it will be accessible through a query service, transformation tasks will use this service to fetch and persist data into the storage layers

This is a layer over an SQL query engine or data warehouse, it should be able to scale depending on the complexity and concurrency of the queries

Data Governance

We should be able to define authentication and fine-grained access controls on the data resident in our data stores. This can be implemented within the query service; these access controls should enforce authorization mechanism to limit on what users can have access to

Data Transformation

When creating transformation tasks, we should be able to generate metadata which includes upstream dependencies, these dependencies are used by our orchestration tool to dynamically generate our transformation DAGs and lineage

Transformations can either be Stream processing or batch processing depending on the need, our architecture covers both, this kind of architecture is called a Lambda Architecture

Security

Security considerations will be taken on data in transit and at rest

DEPLOYING A COST OPTIMIZED RESILIENT DATA PLATFORM SERIES

Building a Scalable Cost Optimized Resilient Data Platform

Part 1 - Building Blocks

Part 3 - Architecture Design

Part 2 - Architecture Requirements Gathering