8.7 KiB
8.7 KiB
Architecture Overview
System Design
The Test Artifact Data Lake is designed as a cloud-native, microservices-ready application that separates concerns between metadata storage and blob storage.
Components
1. FastAPI Application (app/)
Purpose: RESTful API server handling all client requests
Key Modules:
app/main.py: Application entry point, route registrationapp/config.py: Configuration management using Pydanticapp/database.py: Database connection and session management
2. API Layer (app/api/)
Purpose: HTTP endpoint definitions and request handling
Files:
app/api/artifacts.py: All artifact-related endpoints- Upload: Multipart file upload with metadata
- Download: File retrieval with streaming
- Query: Complex filtering and search
- Delete: Cascade deletion from both DB and storage
- Presigned URLs: Temporary download links
3. Models Layer (app/models/)
Purpose: SQLAlchemy ORM models for database tables
Files:
app/models/artifact.py: Artifact model with all metadata fields- File information (name, type, size, path)
- Test metadata (name, suite, config, result)
- Custom metadata and tags
- Versioning support
- Timestamps
4. Schemas Layer (app/schemas/)
Purpose: Pydantic models for request/response validation
Files:
app/schemas/artifact.py:ArtifactCreate: Upload request validationArtifactResponse: API response serializationArtifactQuery: Query filtering parameters
5. Storage Layer (app/storage/)
Purpose: Abstraction over different blob storage backends
Architecture:
StorageBackend (Abstract Base Class)
├── S3Backend (AWS S3 implementation)
└── MinIOBackend (Self-hosted S3-compatible)
Files:
app/storage/base.py: Abstract interfaceapp/storage/s3_backend.py: AWS S3 implementationapp/storage/minio_backend.py: MinIO implementationapp/storage/factory.py: Backend selection logic
Key Methods:
upload_file(): Store blob with unique pathdownload_file(): Retrieve blob by pathdelete_file(): Remove blob from storagefile_exists(): Check blob existenceget_file_url(): Generate presigned download URL
Data Flow
Upload Flow
Client
↓ (multipart/form-data)
FastAPI Endpoint
↓ (parse metadata)
Validation Layer
↓ (generate UUID path)
Storage Backend
↓ (store blob)
Database
↓ (save metadata)
Response (artifact object)
Query Flow
Client
↓ (JSON query)
FastAPI Endpoint
↓ (validate filters)
Database Query Builder
↓ (SQL with filters)
PostgreSQL
↓ (result set)
Response (artifact list)
Download Flow
Client
↓ (GET request)
FastAPI Endpoint
↓ (lookup artifact)
Database
↓ (get storage path)
Storage Backend
↓ (retrieve blob)
StreamingResponse
↓ (binary data)
Client
Database Schema
Table: artifacts
| Column | Type | Description |
|---|---|---|
| id | Integer | Primary key (auto-increment) |
| filename | String(500) | Original filename (indexed) |
| file_type | String(50) | csv, json, binary, pcap (indexed) |
| file_size | BigInteger | File size in bytes |
| storage_path | String(1000) | Full storage path/URL |
| content_type | String(100) | MIME type |
| test_name | String(500) | Test identifier (indexed) |
| test_suite | String(500) | Suite identifier (indexed) |
| test_config | JSON | Test configuration object |
| test_result | String(50) | pass/fail/skip/error (indexed) |
| metadata | JSON | Custom metadata object |
| description | Text | Human-readable description |
| tags | JSON | Array of tags for categorization |
| created_at | DateTime | Creation timestamp (indexed) |
| updated_at | DateTime | Last update timestamp |
| version | String(50) | Version identifier |
| parent_id | Integer | Parent artifact ID (indexed) |
Indexes:
- Primary: id
- Secondary: filename, file_type, test_name, test_suite, test_result, created_at, parent_id
Storage Architecture
Blob Storage
S3/MinIO Bucket Structure:
test-artifacts/
├── {uuid1}.csv
├── {uuid2}.json
├── {uuid3}.pcap
└── {uuid4}.bin
- Files stored with UUID-based names to prevent conflicts
- Original filenames preserved in database metadata
- No directory structure (flat namespace)
Database vs Blob Storage
| Data Type | Storage |
|---|---|
| File content | S3/MinIO |
| Metadata | PostgreSQL |
| Test configs | PostgreSQL (JSON) |
| Custom metadata | PostgreSQL (JSON) |
| Tags | PostgreSQL (JSON array) |
| File paths | PostgreSQL |
Scalability Considerations
Horizontal Scaling
API Layer:
- Stateless FastAPI instances
- Can scale to N replicas
- Load balanced via Kubernetes Service
Database:
- PostgreSQL with read replicas
- Connection pooling
- Query optimization via indexes
Storage:
- S3: Infinite scalability
- MinIO: Can be clustered
Performance Optimizations
- Streaming Uploads/Downloads: Avoids loading entire files into memory
- Database Indexes: Fast queries on common fields
- Presigned URLs: Offload downloads to storage backend
- Async I/O: FastAPI async endpoints for concurrent requests
Security Architecture
Current State (No Auth)
- API is open to all requests
- Suitable for internal networks
- Add authentication middleware as needed
Recommended Enhancements
-
Authentication:
- OAuth 2.0 / OIDC
- API keys
- JWT tokens
-
Authorization:
- Role-based access control (RBAC)
- Resource-level permissions
-
Network Security:
- TLS/HTTPS (via ingress)
- Network policies (Kubernetes)
- VPC isolation (AWS)
-
Data Security:
- Encryption at rest (S3 SSE)
- Encryption in transit (HTTPS)
- Secrets management (Kubernetes Secrets, AWS Secrets Manager)
Deployment Architecture
Local Development
Docker Compose
├── PostgreSQL container
├── MinIO container
└── API container
Kubernetes Production
Kubernetes Cluster
├── Deployment (API pods)
├── Service (load balancer)
├── StatefulSet (PostgreSQL)
├── StatefulSet (MinIO)
├── Ingress (HTTPS termination)
└── Secrets (credentials)
AWS Production
AWS
├── EKS (API pods)
├── RDS PostgreSQL
├── S3 (blob storage)
├── ALB (load balancer)
└── Secrets Manager
Configuration Management
Environment Variables
- Centralized in
app/config.py - Loaded via Pydantic Settings
- Support for
.envfiles - Override via environment variables
Kubernetes ConfigMaps/Secrets
- Non-sensitive: ConfigMaps
- Sensitive: Secrets (base64)
- Mounted as environment variables
Monitoring and Observability
Health Checks
/health: Liveness probe- Database connectivity check
- Storage backend connectivity check
Logging
- Structured logging via Python logging
- JSON format for log aggregation
- Log levels: INFO, WARNING, ERROR
Metrics (Future)
- Prometheus metrics endpoint
- Request count, latency, errors
- Storage usage, database connections
Disaster Recovery
Backup Strategy
- Database: pg_dump scheduled backups
- Storage: S3 versioning, cross-region replication
- Configuration: GitOps (Helm charts in Git)
Recovery Procedures
- Restore database from backup
- Storage automatically available (S3)
- Redeploy application via Helm
Future Enhancements
Performance
- Caching layer (Redis)
- CDN for frequently accessed files
- Database sharding for massive scale
Features
- File versioning UI
- Batch upload API
- Search with full-text search (Elasticsearch)
- File preview generation
- Webhooks for events
Operations
- Automated testing pipeline
- Blue-green deployments
- Canary releases
- Disaster recovery automation
Technology Choices Rationale
| Technology | Why? |
|---|---|
| FastAPI | Modern, fast, auto-generated docs, async support |
| PostgreSQL | Reliable, JSON support, strong indexing |
| S3/MinIO | Industry standard, scalable, S3-compatible |
| SQLAlchemy | Powerful ORM, migration support |
| Pydantic | Type safety, validation, settings management |
| Docker | Containerization, portability |
| Kubernetes/Helm | Orchestration, declarative deployment |
| GitLab CI | Integrated CI/CD, container registry |
Development Principles
- Separation of Concerns: Clear layers (API, models, storage)
- Abstraction: Storage backend abstraction for flexibility
- Configuration as Code: Helm charts, GitOps
- Testability: Dependency injection, mocking interfaces
- Observability: Logging, health checks, metrics
- Security: Secrets management, least privilege
- Scalability: Stateless design, horizontal scaling