Files
warehouse13/docs/ARCHITECTURE.md

348 lines
8.7 KiB
Markdown

# Architecture Overview
## System Design
The Test Artifact Data Lake is designed as a cloud-native, microservices-ready application that separates concerns between metadata storage and blob storage.
## Components
### 1. FastAPI Application (app/)
**Purpose**: RESTful API server handling all client requests
**Key Modules**:
- `app/main.py`: Application entry point, route registration
- `app/config.py`: Configuration management using Pydantic
- `app/database.py`: Database connection and session management
### 2. API Layer (app/api/)
**Purpose**: HTTP endpoint definitions and request handling
**Files**:
- `app/api/artifacts.py`: All artifact-related endpoints
- Upload: Multipart file upload with metadata
- Download: File retrieval with streaming
- Query: Complex filtering and search
- Delete: Cascade deletion from both DB and storage
- Presigned URLs: Temporary download links
### 3. Models Layer (app/models/)
**Purpose**: SQLAlchemy ORM models for database tables
**Files**:
- `app/models/artifact.py`: Artifact model with all metadata fields
- File information (name, type, size, path)
- Test metadata (name, suite, config, result)
- Custom metadata and tags
- Versioning support
- Timestamps
### 4. Schemas Layer (app/schemas/)
**Purpose**: Pydantic models for request/response validation
**Files**:
- `app/schemas/artifact.py`:
- `ArtifactCreate`: Upload request validation
- `ArtifactResponse`: API response serialization
- `ArtifactQuery`: Query filtering parameters
### 5. Storage Layer (app/storage/)
**Purpose**: Abstraction over different blob storage backends
**Architecture**:
```
StorageBackend (Abstract Base Class)
├── S3Backend (AWS S3 implementation)
└── MinIOBackend (Self-hosted S3-compatible)
```
**Files**:
- `app/storage/base.py`: Abstract interface
- `app/storage/s3_backend.py`: AWS S3 implementation
- `app/storage/minio_backend.py`: MinIO implementation
- `app/storage/factory.py`: Backend selection logic
**Key Methods**:
- `upload_file()`: Store blob with unique path
- `download_file()`: Retrieve blob by path
- `delete_file()`: Remove blob from storage
- `file_exists()`: Check blob existence
- `get_file_url()`: Generate presigned download URL
## Data Flow
### Upload Flow
```
Client
↓ (multipart/form-data)
FastAPI Endpoint
↓ (parse metadata)
Validation Layer
↓ (generate UUID path)
Storage Backend
↓ (store blob)
Database
↓ (save metadata)
Response (artifact object)
```
### Query Flow
```
Client
↓ (JSON query)
FastAPI Endpoint
↓ (validate filters)
Database Query Builder
↓ (SQL with filters)
PostgreSQL
↓ (result set)
Response (artifact list)
```
### Download Flow
```
Client
↓ (GET request)
FastAPI Endpoint
↓ (lookup artifact)
Database
↓ (get storage path)
Storage Backend
↓ (retrieve blob)
StreamingResponse
↓ (binary data)
Client
```
## Database Schema
### Table: artifacts
| Column | Type | Description |
|--------|------|-------------|
| id | Integer | Primary key (auto-increment) |
| filename | String(500) | Original filename (indexed) |
| file_type | String(50) | csv, json, binary, pcap (indexed) |
| file_size | BigInteger | File size in bytes |
| storage_path | String(1000) | Full storage path/URL |
| content_type | String(100) | MIME type |
| test_name | String(500) | Test identifier (indexed) |
| test_suite | String(500) | Suite identifier (indexed) |
| test_config | JSON | Test configuration object |
| test_result | String(50) | pass/fail/skip/error (indexed) |
| metadata | JSON | Custom metadata object |
| description | Text | Human-readable description |
| tags | JSON | Array of tags for categorization |
| created_at | DateTime | Creation timestamp (indexed) |
| updated_at | DateTime | Last update timestamp |
| version | String(50) | Version identifier |
| parent_id | Integer | Parent artifact ID (indexed) |
**Indexes**:
- Primary: id
- Secondary: filename, file_type, test_name, test_suite, test_result, created_at, parent_id
## Storage Architecture
### Blob Storage
**S3/MinIO Bucket Structure**:
```
test-artifacts/
├── {uuid1}.csv
├── {uuid2}.json
├── {uuid3}.pcap
└── {uuid4}.bin
```
- Files stored with UUID-based names to prevent conflicts
- Original filenames preserved in database metadata
- No directory structure (flat namespace)
### Database vs Blob Storage
| Data Type | Storage |
|-----------|---------|
| File content | S3/MinIO |
| Metadata | PostgreSQL |
| Test configs | PostgreSQL (JSON) |
| Custom metadata | PostgreSQL (JSON) |
| Tags | PostgreSQL (JSON array) |
| File paths | PostgreSQL |
## Scalability Considerations
### Horizontal Scaling
**API Layer**:
- Stateless FastAPI instances
- Can scale to N replicas
- Load balanced via Kubernetes Service
**Database**:
- PostgreSQL with read replicas
- Connection pooling
- Query optimization via indexes
**Storage**:
- S3: Infinite scalability
- MinIO: Can be clustered
### Performance Optimizations
1. **Streaming Uploads/Downloads**: Avoids loading entire files into memory
2. **Database Indexes**: Fast queries on common fields
3. **Presigned URLs**: Offload downloads to storage backend
4. **Async I/O**: FastAPI async endpoints for concurrent requests
## Security Architecture
### Current State (No Auth)
- API is open to all requests
- Suitable for internal networks
- Add authentication middleware as needed
### Recommended Enhancements
1. **Authentication**:
- OAuth 2.0 / OIDC
- API keys
- JWT tokens
2. **Authorization**:
- Role-based access control (RBAC)
- Resource-level permissions
3. **Network Security**:
- TLS/HTTPS (via ingress)
- Network policies (Kubernetes)
- VPC isolation (AWS)
4. **Data Security**:
- Encryption at rest (S3 SSE)
- Encryption in transit (HTTPS)
- Secrets management (Kubernetes Secrets, AWS Secrets Manager)
## Deployment Architecture
### Local Development
```
Docker Compose
├── PostgreSQL container
├── MinIO container
└── API container
```
### Kubernetes Production
```
Kubernetes Cluster
├── Deployment (API pods)
├── Service (load balancer)
├── StatefulSet (PostgreSQL)
├── StatefulSet (MinIO)
├── Ingress (HTTPS termination)
└── Secrets (credentials)
```
### AWS Production
```
AWS
├── EKS (API pods)
├── RDS PostgreSQL
├── S3 (blob storage)
├── ALB (load balancer)
└── Secrets Manager
```
## Configuration Management
### Environment Variables
- Centralized in `app/config.py`
- Loaded via Pydantic Settings
- Support for `.env` files
- Override via environment variables
### Kubernetes ConfigMaps/Secrets
- Non-sensitive: ConfigMaps
- Sensitive: Secrets (base64)
- Mounted as environment variables
## Monitoring and Observability
### Health Checks
- `/health`: Liveness probe
- Database connectivity check
- Storage backend connectivity check
### Logging
- Structured logging via Python logging
- JSON format for log aggregation
- Log levels: INFO, WARNING, ERROR
### Metrics (Future)
- Prometheus metrics endpoint
- Request count, latency, errors
- Storage usage, database connections
## Disaster Recovery
### Backup Strategy
1. **Database**: pg_dump scheduled backups
2. **Storage**: S3 versioning, cross-region replication
3. **Configuration**: GitOps (Helm charts in Git)
### Recovery Procedures
1. Restore database from backup
2. Storage automatically available (S3)
3. Redeploy application via Helm
## Future Enhancements
### Performance
- Caching layer (Redis)
- CDN for frequently accessed files
- Database sharding for massive scale
### Features
- File versioning UI
- Batch upload API
- Search with full-text search (Elasticsearch)
- File preview generation
- Webhooks for events
### Operations
- Automated testing pipeline
- Blue-green deployments
- Canary releases
- Disaster recovery automation
## Technology Choices Rationale
| Technology | Why? |
|------------|------|
| FastAPI | Modern, fast, auto-generated docs, async support |
| PostgreSQL | Reliable, JSON support, strong indexing |
| S3/MinIO | Industry standard, scalable, S3-compatible |
| SQLAlchemy | Powerful ORM, migration support |
| Pydantic | Type safety, validation, settings management |
| Docker | Containerization, portability |
| Kubernetes/Helm | Orchestration, declarative deployment |
| GitLab CI | Integrated CI/CD, container registry |
## Development Principles
1. **Separation of Concerns**: Clear layers (API, models, storage)
2. **Abstraction**: Storage backend abstraction for flexibility
3. **Configuration as Code**: Helm charts, GitOps
4. **Testability**: Dependency injection, mocking interfaces
5. **Observability**: Logging, health checks, metrics
6. **Security**: Secrets management, least privilege
7. **Scalability**: Stateless design, horizontal scaling