Files
warehouse13/ARCHITECTURE.md
2025-10-14 15:37:37 -05:00

8.7 KiB

Architecture Overview

System Design

The Test Artifact Data Lake is designed as a cloud-native, microservices-ready application that separates concerns between metadata storage and blob storage.

Components

1. FastAPI Application (app/)

Purpose: RESTful API server handling all client requests

Key Modules:

  • app/main.py: Application entry point, route registration
  • app/config.py: Configuration management using Pydantic
  • app/database.py: Database connection and session management

2. API Layer (app/api/)

Purpose: HTTP endpoint definitions and request handling

Files:

  • app/api/artifacts.py: All artifact-related endpoints
    • Upload: Multipart file upload with metadata
    • Download: File retrieval with streaming
    • Query: Complex filtering and search
    • Delete: Cascade deletion from both DB and storage
    • Presigned URLs: Temporary download links

3. Models Layer (app/models/)

Purpose: SQLAlchemy ORM models for database tables

Files:

  • app/models/artifact.py: Artifact model with all metadata fields
    • File information (name, type, size, path)
    • Test metadata (name, suite, config, result)
    • Custom metadata and tags
    • Versioning support
    • Timestamps

4. Schemas Layer (app/schemas/)

Purpose: Pydantic models for request/response validation

Files:

  • app/schemas/artifact.py:
    • ArtifactCreate: Upload request validation
    • ArtifactResponse: API response serialization
    • ArtifactQuery: Query filtering parameters

5. Storage Layer (app/storage/)

Purpose: Abstraction over different blob storage backends

Architecture:

StorageBackend (Abstract Base Class)
    ├── S3Backend (AWS S3 implementation)
    └── MinIOBackend (Self-hosted S3-compatible)

Files:

  • app/storage/base.py: Abstract interface
  • app/storage/s3_backend.py: AWS S3 implementation
  • app/storage/minio_backend.py: MinIO implementation
  • app/storage/factory.py: Backend selection logic

Key Methods:

  • upload_file(): Store blob with unique path
  • download_file(): Retrieve blob by path
  • delete_file(): Remove blob from storage
  • file_exists(): Check blob existence
  • get_file_url(): Generate presigned download URL

Data Flow

Upload Flow

Client
  ↓ (multipart/form-data)
FastAPI Endpoint
  ↓ (parse metadata)
Validation Layer
  ↓ (generate UUID path)
Storage Backend
  ↓ (store blob)
Database
  ↓ (save metadata)
Response (artifact object)

Query Flow

Client
  ↓ (JSON query)
FastAPI Endpoint
  ↓ (validate filters)
Database Query Builder
  ↓ (SQL with filters)
PostgreSQL
  ↓ (result set)
Response (artifact list)

Download Flow

Client
  ↓ (GET request)
FastAPI Endpoint
  ↓ (lookup artifact)
Database
  ↓ (get storage path)
Storage Backend
  ↓ (retrieve blob)
StreamingResponse
  ↓ (binary data)
Client

Database Schema

Table: artifacts

Column Type Description
id Integer Primary key (auto-increment)
filename String(500) Original filename (indexed)
file_type String(50) csv, json, binary, pcap (indexed)
file_size BigInteger File size in bytes
storage_path String(1000) Full storage path/URL
content_type String(100) MIME type
test_name String(500) Test identifier (indexed)
test_suite String(500) Suite identifier (indexed)
test_config JSON Test configuration object
test_result String(50) pass/fail/skip/error (indexed)
metadata JSON Custom metadata object
description Text Human-readable description
tags JSON Array of tags for categorization
created_at DateTime Creation timestamp (indexed)
updated_at DateTime Last update timestamp
version String(50) Version identifier
parent_id Integer Parent artifact ID (indexed)

Indexes:

  • Primary: id
  • Secondary: filename, file_type, test_name, test_suite, test_result, created_at, parent_id

Storage Architecture

Blob Storage

S3/MinIO Bucket Structure:

test-artifacts/
  ├── {uuid1}.csv
  ├── {uuid2}.json
  ├── {uuid3}.pcap
  └── {uuid4}.bin
  • Files stored with UUID-based names to prevent conflicts
  • Original filenames preserved in database metadata
  • No directory structure (flat namespace)

Database vs Blob Storage

Data Type Storage
File content S3/MinIO
Metadata PostgreSQL
Test configs PostgreSQL (JSON)
Custom metadata PostgreSQL (JSON)
Tags PostgreSQL (JSON array)
File paths PostgreSQL

Scalability Considerations

Horizontal Scaling

API Layer:

  • Stateless FastAPI instances
  • Can scale to N replicas
  • Load balanced via Kubernetes Service

Database:

  • PostgreSQL with read replicas
  • Connection pooling
  • Query optimization via indexes

Storage:

  • S3: Infinite scalability
  • MinIO: Can be clustered

Performance Optimizations

  1. Streaming Uploads/Downloads: Avoids loading entire files into memory
  2. Database Indexes: Fast queries on common fields
  3. Presigned URLs: Offload downloads to storage backend
  4. Async I/O: FastAPI async endpoints for concurrent requests

Security Architecture

Current State (No Auth)

  • API is open to all requests
  • Suitable for internal networks
  • Add authentication middleware as needed
  1. Authentication:

    • OAuth 2.0 / OIDC
    • API keys
    • JWT tokens
  2. Authorization:

    • Role-based access control (RBAC)
    • Resource-level permissions
  3. Network Security:

    • TLS/HTTPS (via ingress)
    • Network policies (Kubernetes)
    • VPC isolation (AWS)
  4. Data Security:

    • Encryption at rest (S3 SSE)
    • Encryption in transit (HTTPS)
    • Secrets management (Kubernetes Secrets, AWS Secrets Manager)

Deployment Architecture

Local Development

Docker Compose
  ├── PostgreSQL container
  ├── MinIO container
  └── API container

Kubernetes Production

Kubernetes Cluster
  ├── Deployment (API pods)
  ├── Service (load balancer)
  ├── StatefulSet (PostgreSQL)
  ├── StatefulSet (MinIO)
  ├── Ingress (HTTPS termination)
  └── Secrets (credentials)

AWS Production

AWS
  ├── EKS (API pods)
  ├── RDS PostgreSQL
  ├── S3 (blob storage)
  ├── ALB (load balancer)
  └── Secrets Manager

Configuration Management

Environment Variables

  • Centralized in app/config.py
  • Loaded via Pydantic Settings
  • Support for .env files
  • Override via environment variables

Kubernetes ConfigMaps/Secrets

  • Non-sensitive: ConfigMaps
  • Sensitive: Secrets (base64)
  • Mounted as environment variables

Monitoring and Observability

Health Checks

  • /health: Liveness probe
  • Database connectivity check
  • Storage backend connectivity check

Logging

  • Structured logging via Python logging
  • JSON format for log aggregation
  • Log levels: INFO, WARNING, ERROR

Metrics (Future)

  • Prometheus metrics endpoint
  • Request count, latency, errors
  • Storage usage, database connections

Disaster Recovery

Backup Strategy

  1. Database: pg_dump scheduled backups
  2. Storage: S3 versioning, cross-region replication
  3. Configuration: GitOps (Helm charts in Git)

Recovery Procedures

  1. Restore database from backup
  2. Storage automatically available (S3)
  3. Redeploy application via Helm

Future Enhancements

Performance

  • Caching layer (Redis)
  • CDN for frequently accessed files
  • Database sharding for massive scale

Features

  • File versioning UI
  • Batch upload API
  • Search with full-text search (Elasticsearch)
  • File preview generation
  • Webhooks for events

Operations

  • Automated testing pipeline
  • Blue-green deployments
  • Canary releases
  • Disaster recovery automation

Technology Choices Rationale

Technology Why?
FastAPI Modern, fast, auto-generated docs, async support
PostgreSQL Reliable, JSON support, strong indexing
S3/MinIO Industry standard, scalable, S3-compatible
SQLAlchemy Powerful ORM, migration support
Pydantic Type safety, validation, settings management
Docker Containerization, portability
Kubernetes/Helm Orchestration, declarative deployment
GitLab CI Integrated CI/CD, container registry

Development Principles

  1. Separation of Concerns: Clear layers (API, models, storage)
  2. Abstraction: Storage backend abstraction for flexibility
  3. Configuration as Code: Helm charts, GitOps
  4. Testability: Dependency injection, mocking interfaces
  5. Observability: Logging, health checks, metrics
  6. Security: Secrets management, least privilege
  7. Scalability: Stateless design, horizontal scaling