Data Architecture Patterns

Common patterns for implementing data and analytics use cases

Reference Architecture Patterns

These patterns provide templates for implementing various data and analytics use cases based on the AWS architecture tiers. Each pattern addresses specific requirements and can be customized to fit your specific needs.

Select a pattern that aligns with your use case's data platform tier requirement, business value, and complexity to accelerate implementation.

Pattern 1: Basic Data Collection

Tier 1 Low Complexity Batch Processing
Data Sources S3 Buckets Applications

Description

This fundamental pattern focuses on collecting and storing data from various sources in a centralized S3 data lake. It provides a simple foundation for basic data access and serves as a starting point for more advanced patterns.

Key Components

  • Data Sources: Application databases, CSV files, JSON data, external APIs
  • Amazon S3: Central storage with basic folder structure
  • Basic IAM: Simple access controls for data producers and consumers

Applicable Use Cases

  • Automated First Notice of Loss
  • RPA for Claims Data Entry
  • Proactive Policy Renewal Communications
  • Personalized Policy Chatbots

Implementation Considerations

  • Define a consistent folder structure in S3 to organize data by source and date
  • Implement basic file validation to ensure data quality
  • Define appropriate IAM roles and policies for secure access
  • Document data dictionary and metadata

AWS Services

  • Amazon S3 Core
  • AWS IAM Core
  • AWS Lambda Optional
  • AWS CloudWatch Optional

Pros & Cons

Pros
  • Simple and quick to implement
  • Low cost with minimal maintenance
  • Centralized storage for various data sources
  • Foundation for future data platform growth
Cons
  • Limited analytics capabilities
  • Minimal data governance
  • No built-in data catalog or discovery
  • Limited scalability for complex use cases

Pattern 2: Data Lake with SQL Query

Tier 2 Medium Complexity Analytics
Data Sources S3 Buckets Glue Crawler Glue Catalog Athena Analytics Tools

Description

This pattern extends the basic data collection with cataloging and SQL query capabilities. It enables analysts and business users to explore data using familiar SQL syntax without moving data out of the data lake.

Key Components

  • Amazon S3: Optimized storage with partitioning
  • AWS Glue Crawler: Automatic discovery of data schema
  • AWS Glue Data Catalog: Metadata repository
  • Amazon Athena: SQL query engine

Applicable Use Cases

  • Fraud Detection in Auto Insurance Claims
  • Customer Segmentation by Risk Profile
  • Intelligent Document Processing
  • Contact Center Analytics

Implementation Considerations

  • Convert data to columnar formats like Parquet or ORC for better query performance
  • Implement partitioning strategies based on common query patterns
  • Define appropriate table properties and partitioning schemes in the Glue Data Catalog
  • Establish a business glossary to make data more discoverable

AWS Services

  • Amazon S3 Core
  • AWS Glue Data Catalog Core
  • AWS Glue Crawler Core
  • Amazon Athena Core
  • AWS Lambda Optional
  • AWS QuickSight Optional

Pros & Cons

Pros
  • Self-service SQL-based data access
  • No data movement required for analysis
  • Improved data discovery through metadata
  • Pay-per-query cost model
Cons
  • Limited complex ETL capabilities
  • Performance depends on data format and size
  • Costs can grow for large datasets
  • Limited support for real-time analytics

Pattern 3: Enterprise Data Model

Tier 3 High Complexity Governance
Data Sources S3 Raw Zone Glue ETL S3 Curated Athena Glue Catalog Lake Formation Redshift QuickSight

Description

This pattern focuses on implementing a formal enterprise data model with strong governance controls. It provides a unified view of business data with clear lineage and access controls, enabling consistent analytics and reporting.

Key Components

  • Multi-zone S3 data lake: Raw, curated, and purpose-built zones
  • AWS Glue ETL: Complex data transformations
  • AWS Lake Formation: Fine-grained access control
  • Amazon Redshift: Performance optimization for complex queries
  • Amazon QuickSight: Business intelligence

Applicable Use Cases

  • Claims Leakage Detection & Prevention
  • Predictive Modeling for Claims Severity
  • Risk Assessment for Life Insurance
  • ESG Risk Integration into Underwriting
  • Premium Leakage Detection

Implementation Considerations

  • Develop a formal enterprise data model with business-aligned definitions
  • Implement a data stewardship program with clear ownership
  • Design ETL processes with robust error handling and monitoring
  • Implement data quality rules and validation processes
  • Create a business glossary and metadata management strategy

AWS Services

  • Amazon S3 (multi-zone) Core
  • AWS Glue ETL Core
  • AWS Lake Formation Core
  • Amazon Redshift Core
  • Amazon Athena Core
  • Amazon QuickSight Optional
  • AWS Step Functions Optional

Pros & Cons

Pros
  • Business-aligned data definitions
  • Strong governance and access controls
  • Improved query performance for complex analytics
  • Clear data lineage and quality controls
  • Support for sophisticated use cases
Cons
  • Higher implementation complexity
  • Requires specialized data modeling expertise
  • Higher operational costs
  • Potential bottlenecks in data stewardship processes
  • Change management challenges

Pattern 4: Enterprise Data Warehouse & Advanced Analytics

Tier 4 High Complexity Real-time
Streaming Batch Sources Kinesis S3 Raw Zone Glue ETL/ELT S3 Curated Lake Formation Redshift SageMaker QuickSight Applications Monitoring & Governance

Description

This comprehensive pattern provides a complete enterprise data platform with advanced analytics capabilities, real-time processing, and robust governance. It supports the most sophisticated use cases and provides a semantic layer for business users.

Key Components

  • AWS Kinesis: Real-time data streaming
  • Multi-zone data lake with AWS Lake Formation: Enhanced governance
  • AWS Glue ETL/ELT: Advanced data transformation with orchestration
  • Amazon Redshift Serverless/Provisioned: High-performance analytics
  • Amazon SageMaker: Machine learning capabilities
  • Amazon QuickSight: Self-service BI with semantic layer
  • Monitoring & governance: Comprehensive observability

Applicable Use Cases

  • Real-Time Natural Disaster Monitoring
  • Dynamic Pricing Based on Real-Time Data
  • Digital Twin for Risk Simulation
  • Automated Reinsurance Optimization
  • Scenario-Based Capital Modeling

Implementation Considerations

  • Implement data domains aligned with business functions
  • Design for both batch and real-time data processing
  • Develop comprehensive data quality monitoring and alerting
  • Implement advanced access controls and data security measures
  • Create a business-friendly semantic layer for self-service analytics
  • Design for cost optimization with appropriate usage monitoring

AWS Services

  • Amazon Kinesis Core
  • Amazon S3 (multi-zone) Core
  • AWS Glue ETL Core
  • AWS Lake Formation Core
  • Amazon Redshift Serverless Core
  • Amazon SageMaker Core
  • Amazon QuickSight Core
  • AWS Step Functions Optional
  • AWS CloudWatch Optional
  • AWS Lambda Optional

Pros & Cons

Pros
  • Comprehensive data platform capabilities
  • Support for real-time and batch processing
  • Advanced analytics and machine learning integration
  • Self-service business intelligence
  • Enterprise-grade governance and security
  • Supports most demanding use cases
Cons
  • Significant implementation complexity
  • High operational and infrastructure costs
  • Requires specialized skills across multiple technologies
  • Longer implementation timeline
  • Potential over-engineering for simple use cases

Pattern Comparison

Criteria Pattern 1: Basic Data Collection Pattern 2: Data Lake with SQL Query Pattern 3: Enterprise Data Model Pattern 4: Enterprise DWH & Advanced Analytics
Data Platform Tier Tier 1 Tier 2 Tier 3 Tier 4
Implementation Complexity Low Medium High Very High
Implementation Timeframe 2-4 weeks 1-3 months 3-6 months 6-12 months
Cost Range $ $$ $$$ $$$$
Team Expertise Required Basic AWS knowledge AWS, SQL, basic data modeling Enterprise data modeling, ETL, governance Advanced analytics, ML, data engineering, architecture
Self-Service Capabilities Limited Moderate (SQL) Good (SQL + reporting) Excellent (semantic layer + visualizations)
Real-Time Capabilities None Limited Moderate Comprehensive
Governance & Security Basic Improved Advanced Enterprise-grade

Add-on: Data Sourcing Processes

This section compares different approaches to sourcing data from operational systems. The choice between CDC (Change Data Capture) and batch-based approaches has significant implications for data freshness, system impact, and implementation complexity.

CDC vs Batch Import Methods

Cross-Tier Data Integration ETL/ELT
Change Data Capture (CDC) Source Database CDC Tool (DMS) Target Storage Consumption Log-based Real-time/Near real-time Batch Import Source Database ETL/ELT Process Target Storage Consumption Full/Incremental Scheduled intervals

Change Data Capture (CDC)

CDC captures and tracks data changes in real-time or near real-time, typically by reading database transaction logs or using database triggers. AWS Database Migration Service (DMS) is a common implementation of CDC.

Key Characteristics:
  • Real-time or near real-time: Changes are captured and propagated with minimal delay
  • Low impact: Log-based CDC has minimal impact on source systems
  • Continuous operation: Runs continuously rather than on a schedule
  • Captures all changes: Records inserts, updates, and deletes
  • Transaction consistency: Preserves transaction boundaries for data consistency

Batch Import

Batch import processes extract data from source systems at scheduled intervals, typically using full extracts or incremental loads based on timestamps or other tracking mechanisms.

Key Characteristics:
  • Scheduled execution: Runs at defined intervals (hourly, daily, weekly)
  • Higher resource usage: Can create significant load on source systems during extraction
  • Point-in-time consistency: Data reflects the state at extraction time only
  • Simpler implementation: Generally requires less specialized infrastructure
  • Data staleness: Data freshness depends on batch frequency

Implementation Considerations

Factor CDC (e.g., AWS DMS) Batch Import
Data Freshness Real-time or near real-time (seconds to minutes) Delayed (hours to days)
Source System Impact Low (log-based), Medium (trigger-based) High (especially for full extracts)
Implementation Complexity Medium to High Low to Medium
Best Use Cases Real-time analytics, operational reporting, event-driven architectures Periodic reporting, historical analysis, large data volumes with less time sensitivity
AWS Services AWS DMS, Kinesis, MSK (Kafka) Glue ETL, Lambda, Step Functions

AWS DMS Implementation

Components
  • Replication Instance Core
  • Source Endpoint Core
  • Target Endpoint Core
  • Replication Task Core
  • CloudWatch Monitoring Optional
Setup Process
  1. Create replication instance with appropriate size
  2. Configure source database endpoint with proper permissions
  3. Configure target endpoint (S3, Redshift, etc.)
  4. Create replication task defining tables & transformation rules
  5. Monitor with CloudWatch metrics

When to Use Which?

Choose CDC when you need:
  • Real-time or near real-time data updates
  • Minimal impact on source systems
  • Complete change history including intermediate states
  • Event-driven architectures with data change events
  • Support for microservices and event sourcing patterns
Choose Batch Import when you need:
  • Simpler implementation with fewer moving parts
  • Less frequent data refreshes are acceptable
  • Heavy transformation during the loading process
  • Lower infrastructure costs
  • Handling very large data volumes efficiently

Hybrid Approach

Many modern architectures combine both approaches:

  • CDC for critical data that requires real-time processing
  • Batch for historical or less time-sensitive data to optimize costs
  • CDC for raw data collection followed by batch processing for aggregations
  • Batch as a fallback mechanism for CDC failures or reconciliation

Add-on: Real-Time Data Ingestion Patterns

This section explores different patterns for real-time data ingestion and streaming, focusing on AWS services like Amazon Kinesis. These patterns are crucial for use cases requiring immediate data processing such as real-time analytics, monitoring, and event-driven architectures.

Real-Time Streaming Ingestion Options

Real-Time Streaming Event Processing
IoT Devices Application Logs Clickstream Financial Data Kinesis Data Streams High-throughput, real-time streaming Custom retention (24h to 365 days) Kinesis Data Firehose Load streams into destinations Transformations & batching MSK (Managed Kafka) Fully managed Kafka service High durability, high throughput Lambda Serverless processing Event-driven functions Kinesis Analytics SQL/Flink processing Real-time analytics EC2/ECS/EKS Custom processing Any framework (Spark etc.) S3 Data Lake DynamoDB Redshift OpenSearch

Key Streaming Ingestion Services

Service Key Features Best For Limitations
Amazon Kinesis Data Streams
  • Real-time processing with sub-second latency
  • Durable storage with custom retention
  • Ordered record delivery within shards
  • Multiple consumers per stream
  • Real-time analytics
  • Log and event data collection
  • Mobile data capture
  • IoT device telemetry
  • Shard management overhead
  • Throughput limitations per shard
  • More complex than Firehose
Amazon Kinesis Data Firehose
  • Fully managed, no administration
  • Auto-scaling to match throughput
  • Lambda transformations
  • Batch writing to destinations
  • ETL data pipelines
  • Log delivery to storage
  • IoT data ingestion
  • Clickstream analysis
  • Minimum 60-second buffer
  • No custom consumer applications
  • Limited destination options
Amazon MSK (Managed Kafka)
  • Fully managed Apache Kafka
  • High durability and availability
  • Unlimited retention
  • Kafka compatibility
  • Enterprise event streaming
  • Stream processing pipelines
  • Pub/sub messaging
  • Log aggregation
  • Higher cost than Kinesis
  • More complex configuration
  • Requires Kafka expertise

Common Data Ingestion Patterns

Pattern 1: Direct Ingestion with Processing

Design: Data sources → Kinesis Data Streams → Lambda/KDA → Destination

Use Case: Real-time analytics, anomaly detection, alerting

Benefits: Low latency, immediate processing, flexible consumers

Implementation:

  1. Set up Kinesis Data Streams with appropriate shard count
  2. Configure producers with proper partition keys
  3. Implement Lambda for event-by-event processing
  4. Use KDA for windowed aggregations and joins
  5. Output to destinations or notifications
Pattern 2: ETL Streaming to Data Lake

Design: Data sources → Kinesis Firehose → [Transform] → S3/Redshift

Use Case: Log analytics, clickstream analysis, data warehousing

Benefits: Fully managed, minimal maintenance, cost-effective

Implementation:

  1. Configure Kinesis Firehose delivery stream
  2. Set up optional Lambda transformation
  3. Configure destination settings (S3 prefix, partitioning)
  4. Define buffer conditions and format conversion
  5. Enable error logging and monitoring
Pattern 3: Fan-out Processing

Design: Data sources → Kinesis/MSK → Multiple consumers → Various destinations

Use Case: Multi-purpose data processing, different use cases from same stream

Benefits: Reuse data for different purposes, parallel processing

Implementation:

  1. Create Kinesis Data Stream or MSK cluster
  2. Implement enhanced fan-out consumers for high throughput
  3. Deploy separate processing applications for each use case
  4. Manage consumer checkpointing for fault tolerance
  5. Monitor each consumer application independently
Pattern 4: Lambda Event Source Mapping

Design: Data sources → Kinesis → Lambda (event source) → DynamoDB/S3

Use Case: Serverless stream processing, real-time data transformations

Benefits: Fully serverless, auto-scaling, simplified architecture

Implementation:

  1. Create Kinesis Data Stream
  2. Implement Lambda function for processing
  3. Configure event source mapping with batch size
  4. Set up error handling and retries
  5. Implement state management if needed

Service Selection Guide

Choose the appropriate ingestion service based on your requirements:

Use Kinesis Data Streams when:
  • You need sub-second processing latency
  • Custom processing logic is required
  • Multiple applications need to consume the same data
  • You need to process records in their exact order
  • You need replay capabilities (longer retention)
Use Kinesis Data Firehose when:
  • You need a fully managed, zero-admin solution
  • Your destination is S3, Redshift, OpenSearch, or 3rd party
  • Near real-time latency (1-5 minutes) is acceptable
  • Simple data transformations are sufficient
  • You want a simpler, more cost-effective solution
Use Amazon MSK when:
  • You have existing Kafka applications
  • You need longer retention than Kinesis offers
  • Higher throughput is required
  • You need to maintain Kafka compatibility
  • You need complex stream processing like Kafka Streams

Kinesis Processing Options

Kinesis Data Analytics (KDA)
  • SQL Option: For simple analytics using SQL queries
  • Flink Applications: For complex event processing
  • Use Cases: Time-series analytics, windowed aggregations, streaming joins
  • Advantages: Fully managed, scalable, integrated
AWS Lambda
  • Event Processing: Batch or per-record processing
  • Use Cases: Simple transformations, filtering, enrichment
  • Advantages: Serverless, automatic scaling, low maintenance
Custom Applications (EC2/ECS/EKS)
  • KCL Applications: Managed shard processing
  • Custom Frameworks: Spark Streaming, Flink self-managed
  • Use Cases: Complex processing, custom frameworks, ML inference
  • Advantages: Maximum flexibility, control over resources

Best Practices

  • Proper Sizing: Size Kinesis shards based on throughput needs
  • Partitioning Strategy: Choose partition keys to distribute data evenly
  • Retry Logic: Implement robust retry mechanisms for producer and consumer failures
  • Monitoring: Set CloudWatch alarms for shard utilization, throttling, and delivery delays
  • Error Handling: Implement dead-letter queues for processing failures
  • Cost Optimization: Use on-demand mode for variable workloads
  • Security: Implement encryption, VPC endpoints, and fine-grained IAM policies
  • Scaling: Implement auto-scaling for producers and consumers based on metrics