348 lines
8.7 KiB
Markdown
348 lines
8.7 KiB
Markdown
# Architecture Overview
|
|
|
|
## System Design
|
|
|
|
The Test Artifact Data Lake is designed as a cloud-native, microservices-ready application that separates concerns between metadata storage and blob storage.
|
|
|
|
## Components
|
|
|
|
### 1. FastAPI Application (app/)
|
|
|
|
**Purpose**: RESTful API server handling all client requests
|
|
|
|
**Key Modules**:
|
|
- `app/main.py`: Application entry point, route registration
|
|
- `app/config.py`: Configuration management using Pydantic
|
|
- `app/database.py`: Database connection and session management
|
|
|
|
### 2. API Layer (app/api/)
|
|
|
|
**Purpose**: HTTP endpoint definitions and request handling
|
|
|
|
**Files**:
|
|
- `app/api/artifacts.py`: All artifact-related endpoints
|
|
- Upload: Multipart file upload with metadata
|
|
- Download: File retrieval with streaming
|
|
- Query: Complex filtering and search
|
|
- Delete: Cascade deletion from both DB and storage
|
|
- Presigned URLs: Temporary download links
|
|
|
|
### 3. Models Layer (app/models/)
|
|
|
|
**Purpose**: SQLAlchemy ORM models for database tables
|
|
|
|
**Files**:
|
|
- `app/models/artifact.py`: Artifact model with all metadata fields
|
|
- File information (name, type, size, path)
|
|
- Test metadata (name, suite, config, result)
|
|
- Custom metadata and tags
|
|
- Versioning support
|
|
- Timestamps
|
|
|
|
### 4. Schemas Layer (app/schemas/)
|
|
|
|
**Purpose**: Pydantic models for request/response validation
|
|
|
|
**Files**:
|
|
- `app/schemas/artifact.py`:
|
|
- `ArtifactCreate`: Upload request validation
|
|
- `ArtifactResponse`: API response serialization
|
|
- `ArtifactQuery`: Query filtering parameters
|
|
|
|
### 5. Storage Layer (app/storage/)
|
|
|
|
**Purpose**: Abstraction over different blob storage backends
|
|
|
|
**Architecture**:
|
|
```
|
|
StorageBackend (Abstract Base Class)
|
|
├── S3Backend (AWS S3 implementation)
|
|
└── MinIOBackend (Self-hosted S3-compatible)
|
|
```
|
|
|
|
**Files**:
|
|
- `app/storage/base.py`: Abstract interface
|
|
- `app/storage/s3_backend.py`: AWS S3 implementation
|
|
- `app/storage/minio_backend.py`: MinIO implementation
|
|
- `app/storage/factory.py`: Backend selection logic
|
|
|
|
**Key Methods**:
|
|
- `upload_file()`: Store blob with unique path
|
|
- `download_file()`: Retrieve blob by path
|
|
- `delete_file()`: Remove blob from storage
|
|
- `file_exists()`: Check blob existence
|
|
- `get_file_url()`: Generate presigned download URL
|
|
|
|
## Data Flow
|
|
|
|
### Upload Flow
|
|
|
|
```
|
|
Client
|
|
↓ (multipart/form-data)
|
|
FastAPI Endpoint
|
|
↓ (parse metadata)
|
|
Validation Layer
|
|
↓ (generate UUID path)
|
|
Storage Backend
|
|
↓ (store blob)
|
|
Database
|
|
↓ (save metadata)
|
|
Response (artifact object)
|
|
```
|
|
|
|
### Query Flow
|
|
|
|
```
|
|
Client
|
|
↓ (JSON query)
|
|
FastAPI Endpoint
|
|
↓ (validate filters)
|
|
Database Query Builder
|
|
↓ (SQL with filters)
|
|
PostgreSQL
|
|
↓ (result set)
|
|
Response (artifact list)
|
|
```
|
|
|
|
### Download Flow
|
|
|
|
```
|
|
Client
|
|
↓ (GET request)
|
|
FastAPI Endpoint
|
|
↓ (lookup artifact)
|
|
Database
|
|
↓ (get storage path)
|
|
Storage Backend
|
|
↓ (retrieve blob)
|
|
StreamingResponse
|
|
↓ (binary data)
|
|
Client
|
|
```
|
|
|
|
## Database Schema
|
|
|
|
### Table: artifacts
|
|
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| id | Integer | Primary key (auto-increment) |
|
|
| filename | String(500) | Original filename (indexed) |
|
|
| file_type | String(50) | csv, json, binary, pcap (indexed) |
|
|
| file_size | BigInteger | File size in bytes |
|
|
| storage_path | String(1000) | Full storage path/URL |
|
|
| content_type | String(100) | MIME type |
|
|
| test_name | String(500) | Test identifier (indexed) |
|
|
| test_suite | String(500) | Suite identifier (indexed) |
|
|
| test_config | JSON | Test configuration object |
|
|
| test_result | String(50) | pass/fail/skip/error (indexed) |
|
|
| metadata | JSON | Custom metadata object |
|
|
| description | Text | Human-readable description |
|
|
| tags | JSON | Array of tags for categorization |
|
|
| created_at | DateTime | Creation timestamp (indexed) |
|
|
| updated_at | DateTime | Last update timestamp |
|
|
| version | String(50) | Version identifier |
|
|
| parent_id | Integer | Parent artifact ID (indexed) |
|
|
|
|
**Indexes**:
|
|
- Primary: id
|
|
- Secondary: filename, file_type, test_name, test_suite, test_result, created_at, parent_id
|
|
|
|
## Storage Architecture
|
|
|
|
### Blob Storage
|
|
|
|
**S3/MinIO Bucket Structure**:
|
|
```
|
|
test-artifacts/
|
|
├── {uuid1}.csv
|
|
├── {uuid2}.json
|
|
├── {uuid3}.pcap
|
|
└── {uuid4}.bin
|
|
```
|
|
|
|
- Files stored with UUID-based names to prevent conflicts
|
|
- Original filenames preserved in database metadata
|
|
- No directory structure (flat namespace)
|
|
|
|
### Database vs Blob Storage
|
|
|
|
| Data Type | Storage |
|
|
|-----------|---------|
|
|
| File content | S3/MinIO |
|
|
| Metadata | PostgreSQL |
|
|
| Test configs | PostgreSQL (JSON) |
|
|
| Custom metadata | PostgreSQL (JSON) |
|
|
| Tags | PostgreSQL (JSON array) |
|
|
| File paths | PostgreSQL |
|
|
|
|
## Scalability Considerations
|
|
|
|
### Horizontal Scaling
|
|
|
|
**API Layer**:
|
|
- Stateless FastAPI instances
|
|
- Can scale to N replicas
|
|
- Load balanced via Kubernetes Service
|
|
|
|
**Database**:
|
|
- PostgreSQL with read replicas
|
|
- Connection pooling
|
|
- Query optimization via indexes
|
|
|
|
**Storage**:
|
|
- S3: Infinite scalability
|
|
- MinIO: Can be clustered
|
|
|
|
### Performance Optimizations
|
|
|
|
1. **Streaming Uploads/Downloads**: Avoids loading entire files into memory
|
|
2. **Database Indexes**: Fast queries on common fields
|
|
3. **Presigned URLs**: Offload downloads to storage backend
|
|
4. **Async I/O**: FastAPI async endpoints for concurrent requests
|
|
|
|
## Security Architecture
|
|
|
|
### Current State (No Auth)
|
|
- API is open to all requests
|
|
- Suitable for internal networks
|
|
- Add authentication middleware as needed
|
|
|
|
### Recommended Enhancements
|
|
|
|
1. **Authentication**:
|
|
- OAuth 2.0 / OIDC
|
|
- API keys
|
|
- JWT tokens
|
|
|
|
2. **Authorization**:
|
|
- Role-based access control (RBAC)
|
|
- Resource-level permissions
|
|
|
|
3. **Network Security**:
|
|
- TLS/HTTPS (via ingress)
|
|
- Network policies (Kubernetes)
|
|
- VPC isolation (AWS)
|
|
|
|
4. **Data Security**:
|
|
- Encryption at rest (S3 SSE)
|
|
- Encryption in transit (HTTPS)
|
|
- Secrets management (Kubernetes Secrets, AWS Secrets Manager)
|
|
|
|
## Deployment Architecture
|
|
|
|
### Local Development
|
|
```
|
|
Docker Compose
|
|
├── PostgreSQL container
|
|
├── MinIO container
|
|
└── API container
|
|
```
|
|
|
|
### Kubernetes Production
|
|
```
|
|
Kubernetes Cluster
|
|
├── Deployment (API pods)
|
|
├── Service (load balancer)
|
|
├── StatefulSet (PostgreSQL)
|
|
├── StatefulSet (MinIO)
|
|
├── Ingress (HTTPS termination)
|
|
└── Secrets (credentials)
|
|
```
|
|
|
|
### AWS Production
|
|
```
|
|
AWS
|
|
├── EKS (API pods)
|
|
├── RDS PostgreSQL
|
|
├── S3 (blob storage)
|
|
├── ALB (load balancer)
|
|
└── Secrets Manager
|
|
```
|
|
|
|
## Configuration Management
|
|
|
|
### Environment Variables
|
|
- Centralized in `app/config.py`
|
|
- Loaded via Pydantic Settings
|
|
- Support for `.env` files
|
|
- Override via environment variables
|
|
|
|
### Kubernetes ConfigMaps/Secrets
|
|
- Non-sensitive: ConfigMaps
|
|
- Sensitive: Secrets (base64)
|
|
- Mounted as environment variables
|
|
|
|
## Monitoring and Observability
|
|
|
|
### Health Checks
|
|
- `/health`: Liveness probe
|
|
- Database connectivity check
|
|
- Storage backend connectivity check
|
|
|
|
### Logging
|
|
- Structured logging via Python logging
|
|
- JSON format for log aggregation
|
|
- Log levels: INFO, WARNING, ERROR
|
|
|
|
### Metrics (Future)
|
|
- Prometheus metrics endpoint
|
|
- Request count, latency, errors
|
|
- Storage usage, database connections
|
|
|
|
## Disaster Recovery
|
|
|
|
### Backup Strategy
|
|
1. **Database**: pg_dump scheduled backups
|
|
2. **Storage**: S3 versioning, cross-region replication
|
|
3. **Configuration**: GitOps (Helm charts in Git)
|
|
|
|
### Recovery Procedures
|
|
1. Restore database from backup
|
|
2. Storage automatically available (S3)
|
|
3. Redeploy application via Helm
|
|
|
|
## Future Enhancements
|
|
|
|
### Performance
|
|
- Caching layer (Redis)
|
|
- CDN for frequently accessed files
|
|
- Database sharding for massive scale
|
|
|
|
### Features
|
|
- File versioning UI
|
|
- Batch upload API
|
|
- Search with full-text search (Elasticsearch)
|
|
- File preview generation
|
|
- Webhooks for events
|
|
|
|
### Operations
|
|
- Automated testing pipeline
|
|
- Blue-green deployments
|
|
- Canary releases
|
|
- Disaster recovery automation
|
|
|
|
## Technology Choices Rationale
|
|
|
|
| Technology | Why? |
|
|
|------------|------|
|
|
| FastAPI | Modern, fast, auto-generated docs, async support |
|
|
| PostgreSQL | Reliable, JSON support, strong indexing |
|
|
| S3/MinIO | Industry standard, scalable, S3-compatible |
|
|
| SQLAlchemy | Powerful ORM, migration support |
|
|
| Pydantic | Type safety, validation, settings management |
|
|
| Docker | Containerization, portability |
|
|
| Kubernetes/Helm | Orchestration, declarative deployment |
|
|
| GitLab CI | Integrated CI/CD, container registry |
|
|
|
|
## Development Principles
|
|
|
|
1. **Separation of Concerns**: Clear layers (API, models, storage)
|
|
2. **Abstraction**: Storage backend abstraction for flexibility
|
|
3. **Configuration as Code**: Helm charts, GitOps
|
|
4. **Testability**: Dependency injection, mocking interfaces
|
|
5. **Observability**: Logging, health checks, metrics
|
|
6. **Security**: Secrets management, least privilege
|
|
7. **Scalability**: Stateless design, horizontal scaling
|