📝 DOCUMENTATION
I. Design
1. Project Overview
This project simulates a data system for a coffee shop chain with two primary goals:
- Real-time product suggestions based on new customer orders
- Daily batch processing for analytics-ready data
It is designed to strike a balance between low-latency business needs and scalable analytical workloads, using a hybrid architecture that includes both streaming and batch components.
2. Business Requirements
Real-Time: Product Suggestion
When a new order is placed, the system should evaluate whether it qualifies for a product suggestion. The following conditions must be met:
- The payment method is bank transfer, and the bank is ACB
- The customer is a diamond-tier member
If all conditions are met, a product with a special discount is suggested — different from those already ordered. We will assume that all product suggestions are accepted by customers.
Batch: Daily Analytics Preparation
Each day, raw data is processed to support reporting and dashboarding. This includes:
- Ingest: Load new data from MySQL into the lakehouse
- Clean: Validate and transform the data
- Store: Persist structured, queryable datasets for downstream tools
3. Data Source
The data is divided into several CSV files and used as follows:
product.csv
: A list of products sold in the coffee shop.- id: Unique ID of the product
- name: Name of the product
- category_id: ID of the category this product belongs to
- unit_price: The price of the product
- updated_at: Timestamp of the last update to this product’s data
product_category.csv
: The category of the product.- id: Unique ID for each product category
- name: Name of the category
- updated_at: Timestamp of the last update to this category’s data
store.csv
: The branches of the coffee shop.- id: Unique ID for each coffee shop location
- name: Name of the coffee shop branch
- address: The address of the coffee shop
- district: The district where the branch is located
- city: The city that the branch is located.
- updated_at: Timestamp of the last update to this store’s data.
payment_method.csv
: The payment methods supported by the coffee shop- id: Unique ID for each payment method
- method_name: The name of the payment method (e.g., Cash, Credit Card, etc.)
- bank: Bank associated with the method (used for bank transfer only).
- updated_at: The date when the payment method data was last updated.
diamond_customers.csv
: List of diamond-tier customers.- id: Unique identifier for the customer.
- name: The name of customer.
- phone_number: Customer’s phone number.
- updated_at: Timestamp of the last update to this diamond customer’s data.
orders
: This data is generated automatically via a Python script. It includes detailed information about the orders info and loaded into MySQL.- id: Unique ID for each order.
- timestamp: Date and time when the order was placed.
- store_id: The ID of the store where the order took place.
- payment_method_id: The payment method used for the order.
- customer_id: Customer who orders.
- status: The status of the orders (pending or completed).
order_details
: Auto-generated data representing products in each order (loaded into MySQL)- order_id: Identifier linking to the corresponding order.
- product_id: ID of the purchased product.
- quantity: Quantity of the product purchased.
- discount_percent: Discount applied to this product (used in business logic).
- subtotal: Total price for this product (unit_price Ă— quantity)
- is_suggestion: Indicates whether this product was suggested (used in business logic).
4. Architecture at a Glance
The system separates responsibilities into two pipelines:
- Real-time pipeline: Handles event-driven product suggestion logic upon new orders
- Batch pipeline: Periodically transforms and stores data into the Lakehouse for analytics
5. MySQL: Transactional Landing Zone
MySQL is used as the initial data store for real-time order generation, chosen for its:
- Compatibility with CDC tools like Kafka Connect
- Support for ACID transactions
- Easy integration with Python-based data generators
This layer includes core tables such as orders
, order_details
, and reference tables like products
, stores
, payment_methods
, etc.,
The ER diagram below illustrates the relationships between the core tables:
6. Lakehouse Architecture
In the lakehouse, we structure data using the Medallion Architecture:
- Bronze: raw data ingested from source systems
- Silver: cleaned, transformed and joined data
- Gold: aggregated and business-consumable data for dashboarding and analytics
Data Model in Gold Layer
Gold Layer Data Catalog
gld.dim_products
Provides information about the products and their attributes.
Column Name | Data Type | Description |
---|---|---|
product_key | int | Surrogate key for the product (unique per version for SCD2 tracking). |
product_id | string | Natural/business key from the original product source. |
product_name | string | Name of the product. |
category | string | Category that the product belongs to. |
unit_price | int | Price per unit of the product. |
start_date | date | Date when this version of the product record became valid. |
end_date | date | Date when this version stopped being valid (null if current). |
is_current | boolean | Indicates whether this is the current version of the product (true/false). |
gld.dim_stores
Provides information about store locations and their attributes.
Column Name | Data Type | Description |
---|---|---|
store_key | int | Surrogate key for the store (unique per version for SCD2 tracking). |
store_id | int | Natural/business key from the original store source. |
address | string | Full address of the store. |
district | string | District where the store is located. |
city | string | City where the store is located. |
start_date | date | Date when this version of the store record became valid. |
end_date | date | Date when this version stopped being valid (null if current). |
is_current | boolean | Indicates whether this is the current version of the store (true/false). |
gld.fact_orders
Captures detailed coffee shop order transactions for reporting and analysis.
Column Name | Data Type | Description |
---|---|---|
order_date | date | The date when the order was placed. |
order_id | string | Unique identifier for each order. |
customer_id | int | Identifier for the customer who placed the order. |
store_key | int | Foreign key referencing the store dimension. |
payment_method_key | int | Foreign key referencing the payment method used. |
product_key | int | Foreign key referencing the purchased product. |
quantity | int | Number of units of the product ordered. |
subtotal | int | Total cost for the product line. |
7. Data Pipeline Design
Real-time Flow
Order events are captured via Kafka Connect from MySQL. And Kafka Consumers check the following conditions to determine whether a product suggestion should be made:
- Whether the customer is a diamond-tier member
- Whether the payment method is bank transfer
- Whether the associated bank is ACB
If all conditions are satisfied, the system suggests a product different from those already ordered. And then suggestions are published to the order_suggestion_accepted
topic.
Caching Strategy for Real-time Processing
Since condition checks are performed continuously in real time, lookup tables are cached in Redis to optimize performance:
- products: List of all available products
- payment_methods: Only ID of method bank transfer and bank name ACB
- diamond_customers: List of all customers with diamond-tier membership
- order_info: Mapping of
order_id
tocustomer_id
andpayment_method_id
for fast access
Batch Flow
Data is loaded into the Lakehouse following the Medallion Architecture.
- Incremental Load is applied to optimize data processing.
- Slowly Changing Dimension (SCD Type 2) is implemented for dimension tables to preserve historical changes.
- Airflow orchestrates daily ETL jobs with clear dependencies.
Technology Choices: Why They Matter
To support both real-time responsiveness and scalable analytics, this project combines various technologies, each chosen for specific reasons:
-
Kafka is adopted due to the event-driven nature of the product suggestion logic, which must react instantly as new orders arrive. Kafka enables this by:
- Supporting Change Data Capture via Kafka Connect for syncing new orders in near real-time.
- Offering parallel processing through partitioned topics, allowing multiple consumers to handle multiple orders in parallel.
-
Redis is used to cache lookup data (e.g., products, diamond-tier customers, payment methods) to reduce excessive database access and enable fast condition evaluation. This avoids redundant queries and supports low-latency processing at scale.
-
Apache Spark is chosen to build the Lakehouse architecture:
- Handles both structured and semi-structured data.
- Supports efficient storage formats like Parquet, which reduce storage costs via compression and columnar encoding.
- Integrates with Delta format, which adds ACID compliance, time travel, and incremental processing capabilities — all essential for building reliable, maintainable batch pipelines and Slowly Changing Dimensions (SCD Type 2).
-
Airflow is used to orchestrate repetitive batch jobs, removing the need for manual intervention.