Data Architecture Patterns

Common patterns for implementing data and analytics use cases

Reference Architecture Patterns

These patterns provide templates for implementing various data and analytics use cases based on the AWS architecture tiers. Each pattern addresses specific requirements and can be customized to fit your specific needs.

Select a pattern that aligns with your use case's data platform tier requirement, business value, and complexity to accelerate implementation.

Pattern 1: Basic Data Collection

Tier 1 Low Complexity Batch Processing

Description

This fundamental pattern focuses on collecting and storing data from various sources in a centralized S3 data lake. It provides a simple foundation for basic data access and serves as a starting point for more advanced patterns.

Key Components

Data Sources: Application databases, CSV files, JSON data, external APIs
Amazon S3: Central storage with basic folder structure
Basic IAM: Simple access controls for data producers and consumers

Applicable Use Cases

Automated First Notice of Loss
RPA for Claims Data Entry
Proactive Policy Renewal Communications
Personalized Policy Chatbots

Implementation Considerations

Define a consistent folder structure in S3 to organize data by source and date
Implement basic file validation to ensure data quality
Define appropriate IAM roles and policies for secure access
Document data dictionary and metadata

AWS Services

Amazon S3 Core
AWS IAM Core
AWS Lambda Optional
AWS CloudWatch Optional

Pros & Cons

Pros

Simple and quick to implement
Low cost with minimal maintenance
Centralized storage for various data sources
Foundation for future data platform growth

Cons

Limited analytics capabilities
Minimal data governance
No built-in data catalog or discovery
Limited scalability for complex use cases

Pattern 2: Data Lake with SQL Query

Tier 2 Medium Complexity Analytics

Description

This pattern extends the basic data collection with cataloging and SQL query capabilities. It enables analysts and business users to explore data using familiar SQL syntax without moving data out of the data lake.

Key Components

Amazon S3: Optimized storage with partitioning
AWS Glue Crawler: Automatic discovery of data schema
AWS Glue Data Catalog: Metadata repository
Amazon Athena: SQL query engine

Applicable Use Cases

Fraud Detection in Auto Insurance Claims
Customer Segmentation by Risk Profile
Intelligent Document Processing
Contact Center Analytics

Implementation Considerations

Convert data to columnar formats like Parquet or ORC for better query performance
Implement partitioning strategies based on common query patterns
Define appropriate table properties and partitioning schemes in the Glue Data Catalog
Establish a business glossary to make data more discoverable

AWS Services

Amazon S3 Core
AWS Glue Data Catalog Core
AWS Glue Crawler Core
Amazon Athena Core
AWS Lambda Optional
AWS QuickSight Optional

Pros & Cons

Pros

Self-service SQL-based data access
No data movement required for analysis
Improved data discovery through metadata
Pay-per-query cost model

Cons

Limited complex ETL capabilities
Performance depends on data format and size
Costs can grow for large datasets
Limited support for real-time analytics

Pattern 3: Enterprise Data Model

Tier 3 High Complexity Governance

Description

This pattern focuses on implementing a formal enterprise data model with strong governance controls. It provides a unified view of business data with clear lineage and access controls, enabling consistent analytics and reporting.

Key Components

Multi-zone S3 data lake: Raw, curated, and purpose-built zones
AWS Glue ETL: Complex data transformations
AWS Lake Formation: Fine-grained access control
Amazon Redshift: Performance optimization for complex queries
Amazon QuickSight: Business intelligence

Applicable Use Cases

Claims Leakage Detection & Prevention
Predictive Modeling for Claims Severity
Risk Assessment for Life Insurance
ESG Risk Integration into Underwriting
Premium Leakage Detection

Implementation Considerations

Develop a formal enterprise data model with business-aligned definitions
Implement a data stewardship program with clear ownership
Design ETL processes with robust error handling and monitoring
Implement data quality rules and validation processes
Create a business glossary and metadata management strategy

AWS Services

Amazon S3 (multi-zone) Core
AWS Glue ETL Core
AWS Lake Formation Core
Amazon Redshift Core
Amazon Athena Core
Amazon QuickSight Optional
AWS Step Functions Optional

Pros & Cons

Pros

Business-aligned data definitions
Strong governance and access controls
Improved query performance for complex analytics
Clear data lineage and quality controls
Support for sophisticated use cases

Cons

Higher implementation complexity
Requires specialized data modeling expertise
Higher operational costs
Potential bottlenecks in data stewardship processes
Change management challenges

Pattern 4: Enterprise Data Warehouse & Advanced Analytics

Tier 4 High Complexity Real-time

Description

This comprehensive pattern provides a complete enterprise data platform with advanced analytics capabilities, real-time processing, and robust governance. It supports the most sophisticated use cases and provides a semantic layer for business users.

Key Components

AWS Kinesis: Real-time data streaming
Multi-zone data lake with AWS Lake Formation: Enhanced governance
AWS Glue ETL/ELT: Advanced data transformation with orchestration
Amazon Redshift Serverless/Provisioned: High-performance analytics
Amazon SageMaker: Machine learning capabilities
Amazon QuickSight: Self-service BI with semantic layer
Monitoring & governance: Comprehensive observability

Applicable Use Cases

Real-Time Natural Disaster Monitoring
Dynamic Pricing Based on Real-Time Data
Digital Twin for Risk Simulation
Automated Reinsurance Optimization
Scenario-Based Capital Modeling

Implementation Considerations

Implement data domains aligned with business functions
Design for both batch and real-time data processing
Develop comprehensive data quality monitoring and alerting
Implement advanced access controls and data security measures
Create a business-friendly semantic layer for self-service analytics
Design for cost optimization with appropriate usage monitoring

AWS Services

Amazon Kinesis Core
Amazon S3 (multi-zone) Core
AWS Glue ETL Core
AWS Lake Formation Core
Amazon Redshift Serverless Core
Amazon SageMaker Core
Amazon QuickSight Core
AWS Step Functions Optional
AWS CloudWatch Optional
AWS Lambda Optional

Pros & Cons

Pros

Comprehensive data platform capabilities
Support for real-time and batch processing
Advanced analytics and machine learning integration
Self-service business intelligence
Enterprise-grade governance and security
Supports most demanding use cases

Cons

Significant implementation complexity
High operational and infrastructure costs
Requires specialized skills across multiple technologies
Longer implementation timeline
Potential over-engineering for simple use cases

Pattern Comparison

Criteria	Pattern 1: Basic Data Collection	Pattern 2: Data Lake with SQL Query	Pattern 3: Enterprise Data Model	Pattern 4: Enterprise DWH & Advanced Analytics
Data Platform Tier	Tier 1	Tier 2	Tier 3	Tier 4
Implementation Complexity	Low	Medium	High	Very High
Implementation Timeframe	2-4 weeks	1-3 months	3-6 months	6-12 months
Cost Range	$	$$	$$$	$$$$
Team Expertise Required	Basic AWS knowledge	AWS, SQL, basic data modeling	Enterprise data modeling, ETL, governance	Advanced analytics, ML, data engineering, architecture
Self-Service Capabilities	Limited	Moderate (SQL)	Good (SQL + reporting)	Excellent (semantic layer + visualizations)
Real-Time Capabilities	None	Limited	Moderate	Comprehensive
Governance & Security	Basic	Improved	Advanced	Enterprise-grade

Add-on: Data Sourcing Processes

This section compares different approaches to sourcing data from operational systems. The choice between CDC (Change Data Capture) and batch-based approaches has significant implications for data freshness, system impact, and implementation complexity.

CDC vs Batch Import Methods

Cross-Tier Data Integration ETL/ELT

Change Data Capture (CDC)

CDC captures and tracks data changes in real-time or near real-time, typically by reading database transaction logs or using database triggers. AWS Database Migration Service (DMS) is a common implementation of CDC.

Key Characteristics:

Real-time or near real-time: Changes are captured and propagated with minimal delay
Low impact: Log-based CDC has minimal impact on source systems
Continuous operation: Runs continuously rather than on a schedule
Captures all changes: Records inserts, updates, and deletes
Transaction consistency: Preserves transaction boundaries for data consistency

Batch Import

Batch import processes extract data from source systems at scheduled intervals, typically using full extracts or incremental loads based on timestamps or other tracking mechanisms.

Key Characteristics:

Scheduled execution: Runs at defined intervals (hourly, daily, weekly)
Higher resource usage: Can create significant load on source systems during extraction
Point-in-time consistency: Data reflects the state at extraction time only
Simpler implementation: Generally requires less specialized infrastructure
Data staleness: Data freshness depends on batch frequency

Implementation Considerations

Factor	CDC (e.g., AWS DMS)	Batch Import
Data Freshness	Real-time or near real-time (seconds to minutes)	Delayed (hours to days)
Source System Impact	Low (log-based), Medium (trigger-based)	High (especially for full extracts)
Implementation Complexity	Medium to High	Low to Medium
Best Use Cases	Real-time analytics, operational reporting, event-driven architectures	Periodic reporting, historical analysis, large data volumes with less time sensitivity
AWS Services	AWS DMS, Kinesis, MSK (Kafka)	Glue ETL, Lambda, Step Functions

AWS DMS Implementation

Components

Replication Instance Core
Source Endpoint Core
Target Endpoint Core
Replication Task Core
CloudWatch Monitoring Optional

Setup Process

Create replication instance with appropriate size
Configure source database endpoint with proper permissions
Configure target endpoint (S3, Redshift, etc.)
Create replication task defining tables & transformation rules
Monitor with CloudWatch metrics

When to Use Which?

Choose CDC when you need:

Real-time or near real-time data updates
Minimal impact on source systems
Complete change history including intermediate states
Event-driven architectures with data change events
Support for microservices and event sourcing patterns

Choose Batch Import when you need:

Simpler implementation with fewer moving parts
Less frequent data refreshes are acceptable
Heavy transformation during the loading process
Lower infrastructure costs
Handling very large data volumes efficiently

Hybrid Approach

Many modern architectures combine both approaches:

CDC for critical data that requires real-time processing
Batch for historical or less time-sensitive data to optimize costs
CDC for raw data collection followed by batch processing for aggregations
Batch as a fallback mechanism for CDC failures or reconciliation

Add-on: Real-Time Data Ingestion Patterns

This section explores different patterns for real-time data ingestion and streaming, focusing on AWS services like Amazon Kinesis. These patterns are crucial for use cases requiring immediate data processing such as real-time analytics, monitoring, and event-driven architectures.

Real-Time Streaming Ingestion Options

Real-Time Streaming Event Processing

Key Streaming Ingestion Services

Service	Key Features	Best For	Limitations
Amazon Kinesis Data Streams	Real-time processing with sub-second latency Durable storage with custom retention Ordered record delivery within shards Multiple consumers per stream	Real-time analytics Log and event data collection Mobile data capture IoT device telemetry	Shard management overhead Throughput limitations per shard More complex than Firehose
Amazon Kinesis Data Firehose	Fully managed, no administration Auto-scaling to match throughput Lambda transformations Batch writing to destinations	ETL data pipelines Log delivery to storage IoT data ingestion Clickstream analysis	Minimum 60-second buffer No custom consumer applications Limited destination options
Amazon MSK (Managed Kafka)	Fully managed Apache Kafka High durability and availability Unlimited retention Kafka compatibility	Enterprise event streaming Stream processing pipelines Pub/sub messaging Log aggregation	Higher cost than Kinesis More complex configuration Requires Kafka expertise

Common Data Ingestion Patterns

Pattern 1: Direct Ingestion with Processing

Design: Data sources → Kinesis Data Streams → Lambda/KDA → Destination

Use Case: Real-time analytics, anomaly detection, alerting

Benefits: Low latency, immediate processing, flexible consumers

Implementation:

Set up Kinesis Data Streams with appropriate shard count
Configure producers with proper partition keys
Implement Lambda for event-by-event processing
Use KDA for windowed aggregations and joins
Output to destinations or notifications

Pattern 2: ETL Streaming to Data Lake

Design: Data sources → Kinesis Firehose → [Transform] → S3/Redshift

Use Case: Log analytics, clickstream analysis, data warehousing

Benefits: Fully managed, minimal maintenance, cost-effective

Implementation:

Configure Kinesis Firehose delivery stream
Set up optional Lambda transformation
Configure destination settings (S3 prefix, partitioning)
Define buffer conditions and format conversion
Enable error logging and monitoring

Pattern 3: Fan-out Processing

Design: Data sources → Kinesis/MSK → Multiple consumers → Various destinations

Use Case: Multi-purpose data processing, different use cases from same stream

Benefits: Reuse data for different purposes, parallel processing

Implementation:

Create Kinesis Data Stream or MSK cluster
Implement enhanced fan-out consumers for high throughput
Deploy separate processing applications for each use case
Manage consumer checkpointing for fault tolerance
Monitor each consumer application independently

Pattern 4: Lambda Event Source Mapping

Design: Data sources → Kinesis → Lambda (event source) → DynamoDB/S3

Use Case: Serverless stream processing, real-time data transformations

Benefits: Fully serverless, auto-scaling, simplified architecture

Implementation:

Create Kinesis Data Stream
Implement Lambda function for processing
Configure event source mapping with batch size
Set up error handling and retries
Implement state management if needed

Service Selection Guide

Choose the appropriate ingestion service based on your requirements:

Use Kinesis Data Streams when:

You need sub-second processing latency
Custom processing logic is required
Multiple applications need to consume the same data
You need to process records in their exact order
You need replay capabilities (longer retention)

Use Kinesis Data Firehose when:

You need a fully managed, zero-admin solution
Your destination is S3, Redshift, OpenSearch, or 3rd party
Near real-time latency (1-5 minutes) is acceptable
Simple data transformations are sufficient
You want a simpler, more cost-effective solution

Use Amazon MSK when:

You have existing Kafka applications
You need longer retention than Kinesis offers
Higher throughput is required
You need to maintain Kafka compatibility
You need complex stream processing like Kafka Streams

Kinesis Processing Options

Kinesis Data Analytics (KDA)

SQL Option: For simple analytics using SQL queries
Flink Applications: For complex event processing
Use Cases: Time-series analytics, windowed aggregations, streaming joins
Advantages: Fully managed, scalable, integrated

AWS Lambda

Event Processing: Batch or per-record processing
Use Cases: Simple transformations, filtering, enrichment
Advantages: Serverless, automatic scaling, low maintenance

Custom Applications (EC2/ECS/EKS)

KCL Applications: Managed shard processing
Custom Frameworks: Spark Streaming, Flink self-managed
Use Cases: Complex processing, custom frameworks, ML inference
Advantages: Maximum flexibility, control over resources

Best Practices

Proper Sizing: Size Kinesis shards based on throughput needs
Partitioning Strategy: Choose partition keys to distribute data evenly
Retry Logic: Implement robust retry mechanisms for producer and consumer failures
Monitoring: Set CloudWatch alarms for shard utilization, throttling, and delivery delays
Error Handling: Implement dead-letter queues for processing failures
Cost Optimization: Use on-demand mode for variable workloads
Security: Implement encryption, VPC endpoints, and fine-grained IAM policies
Scaling: Implement auto-scaling for producers and consumers based on metrics

Back to Use Case Hype Cycle