init
This commit is contained in:
347
ARCHITECTURE.md
Normal file
347
ARCHITECTURE.md
Normal file
@@ -0,0 +1,347 @@
|
||||
# Architecture Overview
|
||||
|
||||
## System Design
|
||||
|
||||
The Test Artifact Data Lake is designed as a cloud-native, microservices-ready application that separates concerns between metadata storage and blob storage.
|
||||
|
||||
## Components
|
||||
|
||||
### 1. FastAPI Application (app/)
|
||||
|
||||
**Purpose**: RESTful API server handling all client requests
|
||||
|
||||
**Key Modules**:
|
||||
- `app/main.py`: Application entry point, route registration
|
||||
- `app/config.py`: Configuration management using Pydantic
|
||||
- `app/database.py`: Database connection and session management
|
||||
|
||||
### 2. API Layer (app/api/)
|
||||
|
||||
**Purpose**: HTTP endpoint definitions and request handling
|
||||
|
||||
**Files**:
|
||||
- `app/api/artifacts.py`: All artifact-related endpoints
|
||||
- Upload: Multipart file upload with metadata
|
||||
- Download: File retrieval with streaming
|
||||
- Query: Complex filtering and search
|
||||
- Delete: Cascade deletion from both DB and storage
|
||||
- Presigned URLs: Temporary download links
|
||||
|
||||
### 3. Models Layer (app/models/)
|
||||
|
||||
**Purpose**: SQLAlchemy ORM models for database tables
|
||||
|
||||
**Files**:
|
||||
- `app/models/artifact.py`: Artifact model with all metadata fields
|
||||
- File information (name, type, size, path)
|
||||
- Test metadata (name, suite, config, result)
|
||||
- Custom metadata and tags
|
||||
- Versioning support
|
||||
- Timestamps
|
||||
|
||||
### 4. Schemas Layer (app/schemas/)
|
||||
|
||||
**Purpose**: Pydantic models for request/response validation
|
||||
|
||||
**Files**:
|
||||
- `app/schemas/artifact.py`:
|
||||
- `ArtifactCreate`: Upload request validation
|
||||
- `ArtifactResponse`: API response serialization
|
||||
- `ArtifactQuery`: Query filtering parameters
|
||||
|
||||
### 5. Storage Layer (app/storage/)
|
||||
|
||||
**Purpose**: Abstraction over different blob storage backends
|
||||
|
||||
**Architecture**:
|
||||
```
|
||||
StorageBackend (Abstract Base Class)
|
||||
├── S3Backend (AWS S3 implementation)
|
||||
└── MinIOBackend (Self-hosted S3-compatible)
|
||||
```
|
||||
|
||||
**Files**:
|
||||
- `app/storage/base.py`: Abstract interface
|
||||
- `app/storage/s3_backend.py`: AWS S3 implementation
|
||||
- `app/storage/minio_backend.py`: MinIO implementation
|
||||
- `app/storage/factory.py`: Backend selection logic
|
||||
|
||||
**Key Methods**:
|
||||
- `upload_file()`: Store blob with unique path
|
||||
- `download_file()`: Retrieve blob by path
|
||||
- `delete_file()`: Remove blob from storage
|
||||
- `file_exists()`: Check blob existence
|
||||
- `get_file_url()`: Generate presigned download URL
|
||||
|
||||
## Data Flow
|
||||
|
||||
### Upload Flow
|
||||
|
||||
```
|
||||
Client
|
||||
↓ (multipart/form-data)
|
||||
FastAPI Endpoint
|
||||
↓ (parse metadata)
|
||||
Validation Layer
|
||||
↓ (generate UUID path)
|
||||
Storage Backend
|
||||
↓ (store blob)
|
||||
Database
|
||||
↓ (save metadata)
|
||||
Response (artifact object)
|
||||
```
|
||||
|
||||
### Query Flow
|
||||
|
||||
```
|
||||
Client
|
||||
↓ (JSON query)
|
||||
FastAPI Endpoint
|
||||
↓ (validate filters)
|
||||
Database Query Builder
|
||||
↓ (SQL with filters)
|
||||
PostgreSQL
|
||||
↓ (result set)
|
||||
Response (artifact list)
|
||||
```
|
||||
|
||||
### Download Flow
|
||||
|
||||
```
|
||||
Client
|
||||
↓ (GET request)
|
||||
FastAPI Endpoint
|
||||
↓ (lookup artifact)
|
||||
Database
|
||||
↓ (get storage path)
|
||||
Storage Backend
|
||||
↓ (retrieve blob)
|
||||
StreamingResponse
|
||||
↓ (binary data)
|
||||
Client
|
||||
```
|
||||
|
||||
## Database Schema
|
||||
|
||||
### Table: artifacts
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| id | Integer | Primary key (auto-increment) |
|
||||
| filename | String(500) | Original filename (indexed) |
|
||||
| file_type | String(50) | csv, json, binary, pcap (indexed) |
|
||||
| file_size | BigInteger | File size in bytes |
|
||||
| storage_path | String(1000) | Full storage path/URL |
|
||||
| content_type | String(100) | MIME type |
|
||||
| test_name | String(500) | Test identifier (indexed) |
|
||||
| test_suite | String(500) | Suite identifier (indexed) |
|
||||
| test_config | JSON | Test configuration object |
|
||||
| test_result | String(50) | pass/fail/skip/error (indexed) |
|
||||
| metadata | JSON | Custom metadata object |
|
||||
| description | Text | Human-readable description |
|
||||
| tags | JSON | Array of tags for categorization |
|
||||
| created_at | DateTime | Creation timestamp (indexed) |
|
||||
| updated_at | DateTime | Last update timestamp |
|
||||
| version | String(50) | Version identifier |
|
||||
| parent_id | Integer | Parent artifact ID (indexed) |
|
||||
|
||||
**Indexes**:
|
||||
- Primary: id
|
||||
- Secondary: filename, file_type, test_name, test_suite, test_result, created_at, parent_id
|
||||
|
||||
## Storage Architecture
|
||||
|
||||
### Blob Storage
|
||||
|
||||
**S3/MinIO Bucket Structure**:
|
||||
```
|
||||
test-artifacts/
|
||||
├── {uuid1}.csv
|
||||
├── {uuid2}.json
|
||||
├── {uuid3}.pcap
|
||||
└── {uuid4}.bin
|
||||
```
|
||||
|
||||
- Files stored with UUID-based names to prevent conflicts
|
||||
- Original filenames preserved in database metadata
|
||||
- No directory structure (flat namespace)
|
||||
|
||||
### Database vs Blob Storage
|
||||
|
||||
| Data Type | Storage |
|
||||
|-----------|---------|
|
||||
| File content | S3/MinIO |
|
||||
| Metadata | PostgreSQL |
|
||||
| Test configs | PostgreSQL (JSON) |
|
||||
| Custom metadata | PostgreSQL (JSON) |
|
||||
| Tags | PostgreSQL (JSON array) |
|
||||
| File paths | PostgreSQL |
|
||||
|
||||
## Scalability Considerations
|
||||
|
||||
### Horizontal Scaling
|
||||
|
||||
**API Layer**:
|
||||
- Stateless FastAPI instances
|
||||
- Can scale to N replicas
|
||||
- Load balanced via Kubernetes Service
|
||||
|
||||
**Database**:
|
||||
- PostgreSQL with read replicas
|
||||
- Connection pooling
|
||||
- Query optimization via indexes
|
||||
|
||||
**Storage**:
|
||||
- S3: Infinite scalability
|
||||
- MinIO: Can be clustered
|
||||
|
||||
### Performance Optimizations
|
||||
|
||||
1. **Streaming Uploads/Downloads**: Avoids loading entire files into memory
|
||||
2. **Database Indexes**: Fast queries on common fields
|
||||
3. **Presigned URLs**: Offload downloads to storage backend
|
||||
4. **Async I/O**: FastAPI async endpoints for concurrent requests
|
||||
|
||||
## Security Architecture
|
||||
|
||||
### Current State (No Auth)
|
||||
- API is open to all requests
|
||||
- Suitable for internal networks
|
||||
- Add authentication middleware as needed
|
||||
|
||||
### Recommended Enhancements
|
||||
|
||||
1. **Authentication**:
|
||||
- OAuth 2.0 / OIDC
|
||||
- API keys
|
||||
- JWT tokens
|
||||
|
||||
2. **Authorization**:
|
||||
- Role-based access control (RBAC)
|
||||
- Resource-level permissions
|
||||
|
||||
3. **Network Security**:
|
||||
- TLS/HTTPS (via ingress)
|
||||
- Network policies (Kubernetes)
|
||||
- VPC isolation (AWS)
|
||||
|
||||
4. **Data Security**:
|
||||
- Encryption at rest (S3 SSE)
|
||||
- Encryption in transit (HTTPS)
|
||||
- Secrets management (Kubernetes Secrets, AWS Secrets Manager)
|
||||
|
||||
## Deployment Architecture
|
||||
|
||||
### Local Development
|
||||
```
|
||||
Docker Compose
|
||||
├── PostgreSQL container
|
||||
├── MinIO container
|
||||
└── API container
|
||||
```
|
||||
|
||||
### Kubernetes Production
|
||||
```
|
||||
Kubernetes Cluster
|
||||
├── Deployment (API pods)
|
||||
├── Service (load balancer)
|
||||
├── StatefulSet (PostgreSQL)
|
||||
├── StatefulSet (MinIO)
|
||||
├── Ingress (HTTPS termination)
|
||||
└── Secrets (credentials)
|
||||
```
|
||||
|
||||
### AWS Production
|
||||
```
|
||||
AWS
|
||||
├── EKS (API pods)
|
||||
├── RDS PostgreSQL
|
||||
├── S3 (blob storage)
|
||||
├── ALB (load balancer)
|
||||
└── Secrets Manager
|
||||
```
|
||||
|
||||
## Configuration Management
|
||||
|
||||
### Environment Variables
|
||||
- Centralized in `app/config.py`
|
||||
- Loaded via Pydantic Settings
|
||||
- Support for `.env` files
|
||||
- Override via environment variables
|
||||
|
||||
### Kubernetes ConfigMaps/Secrets
|
||||
- Non-sensitive: ConfigMaps
|
||||
- Sensitive: Secrets (base64)
|
||||
- Mounted as environment variables
|
||||
|
||||
## Monitoring and Observability
|
||||
|
||||
### Health Checks
|
||||
- `/health`: Liveness probe
|
||||
- Database connectivity check
|
||||
- Storage backend connectivity check
|
||||
|
||||
### Logging
|
||||
- Structured logging via Python logging
|
||||
- JSON format for log aggregation
|
||||
- Log levels: INFO, WARNING, ERROR
|
||||
|
||||
### Metrics (Future)
|
||||
- Prometheus metrics endpoint
|
||||
- Request count, latency, errors
|
||||
- Storage usage, database connections
|
||||
|
||||
## Disaster Recovery
|
||||
|
||||
### Backup Strategy
|
||||
1. **Database**: pg_dump scheduled backups
|
||||
2. **Storage**: S3 versioning, cross-region replication
|
||||
3. **Configuration**: GitOps (Helm charts in Git)
|
||||
|
||||
### Recovery Procedures
|
||||
1. Restore database from backup
|
||||
2. Storage automatically available (S3)
|
||||
3. Redeploy application via Helm
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
### Performance
|
||||
- Caching layer (Redis)
|
||||
- CDN for frequently accessed files
|
||||
- Database sharding for massive scale
|
||||
|
||||
### Features
|
||||
- File versioning UI
|
||||
- Batch upload API
|
||||
- Search with full-text search (Elasticsearch)
|
||||
- File preview generation
|
||||
- Webhooks for events
|
||||
|
||||
### Operations
|
||||
- Automated testing pipeline
|
||||
- Blue-green deployments
|
||||
- Canary releases
|
||||
- Disaster recovery automation
|
||||
|
||||
## Technology Choices Rationale
|
||||
|
||||
| Technology | Why? |
|
||||
|------------|------|
|
||||
| FastAPI | Modern, fast, auto-generated docs, async support |
|
||||
| PostgreSQL | Reliable, JSON support, strong indexing |
|
||||
| S3/MinIO | Industry standard, scalable, S3-compatible |
|
||||
| SQLAlchemy | Powerful ORM, migration support |
|
||||
| Pydantic | Type safety, validation, settings management |
|
||||
| Docker | Containerization, portability |
|
||||
| Kubernetes/Helm | Orchestration, declarative deployment |
|
||||
| GitLab CI | Integrated CI/CD, container registry |
|
||||
|
||||
## Development Principles
|
||||
|
||||
1. **Separation of Concerns**: Clear layers (API, models, storage)
|
||||
2. **Abstraction**: Storage backend abstraction for flexibility
|
||||
3. **Configuration as Code**: Helm charts, GitOps
|
||||
4. **Testability**: Dependency injection, mocking interfaces
|
||||
5. **Observability**: Logging, health checks, metrics
|
||||
6. **Security**: Secrets management, least privilege
|
||||
7. **Scalability**: Stateless design, horizontal scaling
|
||||
Reference in New Issue
Block a user