warehouse13/docs/ARCHITECTURE.md

# Architecture Overview

## System Design

The Test Artifact Data Lake is designed as a cloud-native, microservices-ready application that separates concerns between metadata storage and blob storage.

## Components

### 1. FastAPI Application (app/)

**Purpose**: RESTful API server handling all client requests

**Key Modules**:
- `app/main.py`: Application entry point, route registration
- `app/config.py`: Configuration management using Pydantic
- `app/database.py`: Database connection and session management

### 2. API Layer (app/api/)

**Purpose**: HTTP endpoint definitions and request handling

**Files**:
- `app/api/artifacts.py`: All artifact-related endpoints
  - Upload: Multipart file upload with metadata
  - Download: File retrieval with streaming
  - Query: Complex filtering and search
  - Delete: Cascade deletion from both DB and storage
  - Presigned URLs: Temporary download links

### 3. Models Layer (app/models/)

**Purpose**: SQLAlchemy ORM models for database tables

**Files**:
- `app/models/artifact.py`: Artifact model with all metadata fields
  - File information (name, type, size, path)
  - Test metadata (name, suite, config, result)
  - Custom metadata and tags
  - Versioning support
  - Timestamps

### 4. Schemas Layer (app/schemas/)

**Purpose**: Pydantic models for request/response validation

**Files**:
- `app/schemas/artifact.py`:
  - `ArtifactCreate`: Upload request validation
  - `ArtifactResponse`: API response serialization
  - `ArtifactQuery`: Query filtering parameters

### 5. Storage Layer (app/storage/)

**Purpose**: Abstraction over different blob storage backends

**Architecture**:
```
StorageBackend (Abstract Base Class)
    ├── S3Backend (AWS S3 implementation)
    └── MinIOBackend (Self-hosted S3-compatible)
```

**Files**:
- `app/storage/base.py`: Abstract interface
- `app/storage/s3_backend.py`: AWS S3 implementation
- `app/storage/minio_backend.py`: MinIO implementation
- `app/storage/factory.py`: Backend selection logic

**Key Methods**:
- `upload_file()`: Store blob with unique path
- `download_file()`: Retrieve blob by path
- `delete_file()`: Remove blob from storage
- `file_exists()`: Check blob existence
- `get_file_url()`: Generate presigned download URL

## Data Flow

### Upload Flow

```
Client
  ↓ (multipart/form-data)
FastAPI Endpoint
  ↓ (parse metadata)
Validation Layer
  ↓ (generate UUID path)
Storage Backend
  ↓ (store blob)
Database
  ↓ (save metadata)
Response (artifact object)
```

### Query Flow

```
Client
  ↓ (JSON query)
FastAPI Endpoint
  ↓ (validate filters)
Database Query Builder
  ↓ (SQL with filters)
PostgreSQL
  ↓ (result set)
Response (artifact list)
```

### Download Flow

```
Client
  ↓ (GET request)
FastAPI Endpoint
  ↓ (lookup artifact)
Database
  ↓ (get storage path)
Storage Backend
  ↓ (retrieve blob)
StreamingResponse
  ↓ (binary data)
Client
```

## Database Schema

### Table: artifacts

| Column | Type | Description |
|--------|------|-------------|
| id | Integer | Primary key (auto-increment) |
| filename | String(500) | Original filename (indexed) |
| file_type | String(50) | csv, json, binary, pcap (indexed) |
| file_size | BigInteger | File size in bytes |
| storage_path | String(1000) | Full storage path/URL |
| content_type | String(100) | MIME type |
| test_name | String(500) | Test identifier (indexed) |
| test_suite | String(500) | Suite identifier (indexed) |
| test_config | JSON | Test configuration object |
| test_result | String(50) | pass/fail/skip/error (indexed) |
| metadata | JSON | Custom metadata object |
| description | Text | Human-readable description |
| tags | JSON | Array of tags for categorization |
| created_at | DateTime | Creation timestamp (indexed) |
| updated_at | DateTime | Last update timestamp |
| version | String(50) | Version identifier |
| parent_id | Integer | Parent artifact ID (indexed) |

**Indexes**:
- Primary: id
- Secondary: filename, file_type, test_name, test_suite, test_result, created_at, parent_id

## Storage Architecture

### Blob Storage

**S3/MinIO Bucket Structure**:
```
test-artifacts/
  ├── {uuid1}.csv
  ├── {uuid2}.json
  ├── {uuid3}.pcap
  └── {uuid4}.bin
```

- Files stored with UUID-based names to prevent conflicts
- Original filenames preserved in database metadata
- No directory structure (flat namespace)

### Database vs Blob Storage

| Data Type | Storage |
|-----------|---------|
| File content | S3/MinIO |
| Metadata | PostgreSQL |
| Test configs | PostgreSQL (JSON) |
| Custom metadata | PostgreSQL (JSON) |
| Tags | PostgreSQL (JSON array) |
| File paths | PostgreSQL |

## Scalability Considerations

### Horizontal Scaling

**API Layer**:
- Stateless FastAPI instances
- Can scale to N replicas
- Load balanced via Kubernetes Service

**Database**:
- PostgreSQL with read replicas
- Connection pooling
- Query optimization via indexes

**Storage**:
- S3: Infinite scalability
- MinIO: Can be clustered

### Performance Optimizations

1. **Streaming Uploads/Downloads**: Avoids loading entire files into memory
2. **Database Indexes**: Fast queries on common fields
3. **Presigned URLs**: Offload downloads to storage backend
4. **Async I/O**: FastAPI async endpoints for concurrent requests

## Security Architecture

### Current State (No Auth)
- API is open to all requests
- Suitable for internal networks
- Add authentication middleware as needed

### Recommended Enhancements

1. **Authentication**:
   - OAuth 2.0 / OIDC
   - API keys
   - JWT tokens

2. **Authorization**:
   - Role-based access control (RBAC)
   - Resource-level permissions

3. **Network Security**:
   - TLS/HTTPS (via ingress)
   - Network policies (Kubernetes)
   - VPC isolation (AWS)

4. **Data Security**:
   - Encryption at rest (S3 SSE)
   - Encryption in transit (HTTPS)
   - Secrets management (Kubernetes Secrets, AWS Secrets Manager)

## Deployment Architecture

### Local Development
```
Docker Compose
  ├── PostgreSQL container
  ├── MinIO container
  └── API container
```

### Kubernetes Production
```
Kubernetes Cluster
  ├── Deployment (API pods)
  ├── Service (load balancer)
  ├── StatefulSet (PostgreSQL)
  ├── StatefulSet (MinIO)
  ├── Ingress (HTTPS termination)
  └── Secrets (credentials)
```

### AWS Production
```
AWS
  ├── EKS (API pods)
  ├── RDS PostgreSQL
  ├── S3 (blob storage)
  ├── ALB (load balancer)
  └── Secrets Manager
```

## Configuration Management

### Environment Variables
- Centralized in `app/config.py`
- Loaded via Pydantic Settings
- Support for `.env` files
- Override via environment variables

### Kubernetes ConfigMaps/Secrets
- Non-sensitive: ConfigMaps
- Sensitive: Secrets (base64)
- Mounted as environment variables

## Monitoring and Observability

### Health Checks
- `/health`: Liveness probe
- Database connectivity check
- Storage backend connectivity check

### Logging
- Structured logging via Python logging
- JSON format for log aggregation
- Log levels: INFO, WARNING, ERROR

### Metrics (Future)
- Prometheus metrics endpoint
- Request count, latency, errors
- Storage usage, database connections

## Disaster Recovery

### Backup Strategy
1. **Database**: pg_dump scheduled backups
2. **Storage**: S3 versioning, cross-region replication
3. **Configuration**: GitOps (Helm charts in Git)

### Recovery Procedures
1. Restore database from backup
2. Storage automatically available (S3)
3. Redeploy application via Helm

## Future Enhancements

### Performance
- Caching layer (Redis)
- CDN for frequently accessed files
- Database sharding for massive scale

### Features
- File versioning UI
- Batch upload API
- Search with full-text search (Elasticsearch)
- File preview generation
- Webhooks for events

### Operations
- Automated testing pipeline
- Blue-green deployments
- Canary releases
- Disaster recovery automation

## Technology Choices Rationale

| Technology | Why? |
|------------|------|
| FastAPI | Modern, fast, auto-generated docs, async support |
| PostgreSQL | Reliable, JSON support, strong indexing |
| S3/MinIO | Industry standard, scalable, S3-compatible |
| SQLAlchemy | Powerful ORM, migration support |
| Pydantic | Type safety, validation, settings management |
| Docker | Containerization, portability |
| Kubernetes/Helm | Orchestration, declarative deployment |
| GitLab CI | Integrated CI/CD, container registry |

## Development Principles

1. **Separation of Concerns**: Clear layers (API, models, storage)
2. **Abstraction**: Storage backend abstraction for flexibility
3. **Configuration as Code**: Helm charts, GitOps
4. **Testability**: Dependency injection, mocking interfaces
5. **Observability**: Logging, health checks, metrics
6. **Security**: Secrets management, least privilege
7. **Scalability**: Stateless design, horizontal scaling