# Architecture Overview ## System Design The Test Artifact Data Lake is designed as a cloud-native, microservices-ready application that separates concerns between metadata storage and blob storage. ## Components ### 1. FastAPI Application (app/) **Purpose**: RESTful API server handling all client requests **Key Modules**: - `app/main.py`: Application entry point, route registration - `app/config.py`: Configuration management using Pydantic - `app/database.py`: Database connection and session management ### 2. API Layer (app/api/) **Purpose**: HTTP endpoint definitions and request handling **Files**: - `app/api/artifacts.py`: All artifact-related endpoints - Upload: Multipart file upload with metadata - Download: File retrieval with streaming - Query: Complex filtering and search - Delete: Cascade deletion from both DB and storage - Presigned URLs: Temporary download links ### 3. Models Layer (app/models/) **Purpose**: SQLAlchemy ORM models for database tables **Files**: - `app/models/artifact.py`: Artifact model with all metadata fields - File information (name, type, size, path) - Test metadata (name, suite, config, result) - Custom metadata and tags - Versioning support - Timestamps ### 4. Schemas Layer (app/schemas/) **Purpose**: Pydantic models for request/response validation **Files**: - `app/schemas/artifact.py`: - `ArtifactCreate`: Upload request validation - `ArtifactResponse`: API response serialization - `ArtifactQuery`: Query filtering parameters ### 5. Storage Layer (app/storage/) **Purpose**: Abstraction over different blob storage backends **Architecture**: ``` StorageBackend (Abstract Base Class) ├── S3Backend (AWS S3 implementation) └── MinIOBackend (Self-hosted S3-compatible) ``` **Files**: - `app/storage/base.py`: Abstract interface - `app/storage/s3_backend.py`: AWS S3 implementation - `app/storage/minio_backend.py`: MinIO implementation - `app/storage/factory.py`: Backend selection logic **Key Methods**: - `upload_file()`: Store blob with unique path - `download_file()`: Retrieve blob by path - `delete_file()`: Remove blob from storage - `file_exists()`: Check blob existence - `get_file_url()`: Generate presigned download URL ## Data Flow ### Upload Flow ``` Client ↓ (multipart/form-data) FastAPI Endpoint ↓ (parse metadata) Validation Layer ↓ (generate UUID path) Storage Backend ↓ (store blob) Database ↓ (save metadata) Response (artifact object) ``` ### Query Flow ``` Client ↓ (JSON query) FastAPI Endpoint ↓ (validate filters) Database Query Builder ↓ (SQL with filters) PostgreSQL ↓ (result set) Response (artifact list) ``` ### Download Flow ``` Client ↓ (GET request) FastAPI Endpoint ↓ (lookup artifact) Database ↓ (get storage path) Storage Backend ↓ (retrieve blob) StreamingResponse ↓ (binary data) Client ``` ## Database Schema ### Table: artifacts | Column | Type | Description | |--------|------|-------------| | id | Integer | Primary key (auto-increment) | | filename | String(500) | Original filename (indexed) | | file_type | String(50) | csv, json, binary, pcap (indexed) | | file_size | BigInteger | File size in bytes | | storage_path | String(1000) | Full storage path/URL | | content_type | String(100) | MIME type | | test_name | String(500) | Test identifier (indexed) | | test_suite | String(500) | Suite identifier (indexed) | | test_config | JSON | Test configuration object | | test_result | String(50) | pass/fail/skip/error (indexed) | | metadata | JSON | Custom metadata object | | description | Text | Human-readable description | | tags | JSON | Array of tags for categorization | | created_at | DateTime | Creation timestamp (indexed) | | updated_at | DateTime | Last update timestamp | | version | String(50) | Version identifier | | parent_id | Integer | Parent artifact ID (indexed) | **Indexes**: - Primary: id - Secondary: filename, file_type, test_name, test_suite, test_result, created_at, parent_id ## Storage Architecture ### Blob Storage **S3/MinIO Bucket Structure**: ``` test-artifacts/ ├── {uuid1}.csv ├── {uuid2}.json ├── {uuid3}.pcap └── {uuid4}.bin ``` - Files stored with UUID-based names to prevent conflicts - Original filenames preserved in database metadata - No directory structure (flat namespace) ### Database vs Blob Storage | Data Type | Storage | |-----------|---------| | File content | S3/MinIO | | Metadata | PostgreSQL | | Test configs | PostgreSQL (JSON) | | Custom metadata | PostgreSQL (JSON) | | Tags | PostgreSQL (JSON array) | | File paths | PostgreSQL | ## Scalability Considerations ### Horizontal Scaling **API Layer**: - Stateless FastAPI instances - Can scale to N replicas - Load balanced via Kubernetes Service **Database**: - PostgreSQL with read replicas - Connection pooling - Query optimization via indexes **Storage**: - S3: Infinite scalability - MinIO: Can be clustered ### Performance Optimizations 1. **Streaming Uploads/Downloads**: Avoids loading entire files into memory 2. **Database Indexes**: Fast queries on common fields 3. **Presigned URLs**: Offload downloads to storage backend 4. **Async I/O**: FastAPI async endpoints for concurrent requests ## Security Architecture ### Current State (No Auth) - API is open to all requests - Suitable for internal networks - Add authentication middleware as needed ### Recommended Enhancements 1. **Authentication**: - OAuth 2.0 / OIDC - API keys - JWT tokens 2. **Authorization**: - Role-based access control (RBAC) - Resource-level permissions 3. **Network Security**: - TLS/HTTPS (via ingress) - Network policies (Kubernetes) - VPC isolation (AWS) 4. **Data Security**: - Encryption at rest (S3 SSE) - Encryption in transit (HTTPS) - Secrets management (Kubernetes Secrets, AWS Secrets Manager) ## Deployment Architecture ### Local Development ``` Docker Compose ├── PostgreSQL container ├── MinIO container └── API container ``` ### Kubernetes Production ``` Kubernetes Cluster ├── Deployment (API pods) ├── Service (load balancer) ├── StatefulSet (PostgreSQL) ├── StatefulSet (MinIO) ├── Ingress (HTTPS termination) └── Secrets (credentials) ``` ### AWS Production ``` AWS ├── EKS (API pods) ├── RDS PostgreSQL ├── S3 (blob storage) ├── ALB (load balancer) └── Secrets Manager ``` ## Configuration Management ### Environment Variables - Centralized in `app/config.py` - Loaded via Pydantic Settings - Support for `.env` files - Override via environment variables ### Kubernetes ConfigMaps/Secrets - Non-sensitive: ConfigMaps - Sensitive: Secrets (base64) - Mounted as environment variables ## Monitoring and Observability ### Health Checks - `/health`: Liveness probe - Database connectivity check - Storage backend connectivity check ### Logging - Structured logging via Python logging - JSON format for log aggregation - Log levels: INFO, WARNING, ERROR ### Metrics (Future) - Prometheus metrics endpoint - Request count, latency, errors - Storage usage, database connections ## Disaster Recovery ### Backup Strategy 1. **Database**: pg_dump scheduled backups 2. **Storage**: S3 versioning, cross-region replication 3. **Configuration**: GitOps (Helm charts in Git) ### Recovery Procedures 1. Restore database from backup 2. Storage automatically available (S3) 3. Redeploy application via Helm ## Future Enhancements ### Performance - Caching layer (Redis) - CDN for frequently accessed files - Database sharding for massive scale ### Features - File versioning UI - Batch upload API - Search with full-text search (Elasticsearch) - File preview generation - Webhooks for events ### Operations - Automated testing pipeline - Blue-green deployments - Canary releases - Disaster recovery automation ## Technology Choices Rationale | Technology | Why? | |------------|------| | FastAPI | Modern, fast, auto-generated docs, async support | | PostgreSQL | Reliable, JSON support, strong indexing | | S3/MinIO | Industry standard, scalable, S3-compatible | | SQLAlchemy | Powerful ORM, migration support | | Pydantic | Type safety, validation, settings management | | Docker | Containerization, portability | | Kubernetes/Helm | Orchestration, declarative deployment | | GitLab CI | Integrated CI/CD, container registry | ## Development Principles 1. **Separation of Concerns**: Clear layers (API, models, storage) 2. **Abstraction**: Storage backend abstraction for flexibility 3. **Configuration as Code**: Helm charts, GitOps 4. **Testability**: Dependency injection, mocking interfaces 5. **Observability**: Logging, health checks, metrics 6. **Security**: Secrets management, least privilege 7. **Scalability**: Stateless design, horizontal scaling