This commit is contained in:
2025-10-14 15:37:37 -05:00
commit 6821e717cd
39 changed files with 3346 additions and 0 deletions

465
DEPLOYMENT.md Normal file
View File

@@ -0,0 +1,465 @@
# Deployment Guide
This guide covers deploying the Test Artifact Data Lake in various environments.
## Table of Contents
- [Local Development](#local-development)
- [Docker Compose](#docker-compose)
- [Kubernetes/Helm](#kuberneteshelm)
- [AWS Deployment](#aws-deployment)
- [Self-Hosted Deployment](#self-hosted-deployment)
- [GitLab CI/CD](#gitlab-cicd)
---
## Local Development
### Prerequisites
- Python 3.11+
- PostgreSQL 15+
- MinIO or AWS S3 access
### Steps
1. **Create virtual environment:**
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```
2. **Install dependencies:**
```bash
pip install -r requirements.txt
```
3. **Set up PostgreSQL:**
```bash
createdb datalake
```
4. **Configure environment:**
```bash
cp .env.example .env
# Edit .env with your configuration
```
5. **Run the application:**
```bash
python -m uvicorn app.main:app --reload
```
---
## Docker Compose
### Quick Start
1. **Start all services:**
```bash
docker-compose up -d
```
2. **Check logs:**
```bash
docker-compose logs -f api
```
3. **Stop services:**
```bash
docker-compose down
```
### Services Included
- PostgreSQL (port 5432)
- MinIO (port 9000, console 9001)
- API (port 8000)
### Customization
Edit `docker-compose.yml` to:
- Change port mappings
- Adjust resource limits
- Add environment variables
- Configure volumes
---
## Kubernetes/Helm
### Prerequisites
- Kubernetes cluster (1.24+)
- Helm 3.x
- kubectl configured
### Installation
1. **Add dependencies (if using PostgreSQL/MinIO from Bitnami):**
```bash
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update
```
2. **Install with default values:**
```bash
helm install datalake ./helm \
--namespace datalake \
--create-namespace
```
3. **Custom installation:**
```bash
helm install datalake ./helm \
--namespace datalake \
--create-namespace \
--set image.repository=your-registry/datalake \
--set image.tag=1.0.0 \
--set ingress.enabled=true \
--set ingress.hosts[0].host=datalake.yourdomain.com
```
### Configuration Options
**Image:**
```bash
--set image.repository=your-registry/datalake
--set image.tag=1.0.0
--set image.pullPolicy=Always
```
**Resources:**
```bash
--set resources.requests.cpu=1000m
--set resources.requests.memory=1Gi
--set resources.limits.cpu=2000m
--set resources.limits.memory=2Gi
```
**Autoscaling:**
```bash
--set autoscaling.enabled=true
--set autoscaling.minReplicas=3
--set autoscaling.maxReplicas=10
--set autoscaling.targetCPUUtilizationPercentage=80
```
**Ingress:**
```bash
--set ingress.enabled=true
--set ingress.className=nginx
--set ingress.hosts[0].host=datalake.example.com
--set ingress.hosts[0].paths[0].path=/
--set ingress.hosts[0].paths[0].pathType=Prefix
```
### Upgrade
```bash
helm upgrade datalake ./helm \
--namespace datalake \
--set image.tag=1.1.0
```
### Uninstall
```bash
helm uninstall datalake --namespace datalake
```
---
## AWS Deployment
### Using AWS S3 Storage
1. **Create S3 bucket:**
```bash
aws s3 mb s3://your-test-artifacts-bucket
```
2. **Create IAM user with S3 access:**
```bash
aws iam create-user --user-name datalake-service
aws iam attach-user-policy --user-name datalake-service \
--policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
```
3. **Generate access keys:**
```bash
aws iam create-access-key --user-name datalake-service
```
4. **Deploy with Helm:**
```bash
helm install datalake ./helm \
--namespace datalake \
--create-namespace \
--set config.storageBackend=s3 \
--set aws.enabled=true \
--set aws.accessKeyId=YOUR_ACCESS_KEY \
--set aws.secretAccessKey=YOUR_SECRET_KEY \
--set aws.region=us-east-1 \
--set aws.bucketName=your-test-artifacts-bucket \
--set minio.enabled=false
```
### Using EKS
1. **Create EKS cluster:**
```bash
eksctl create cluster \
--name datalake-cluster \
--region us-east-1 \
--nodegroup-name standard-workers \
--node-type t3.medium \
--nodes 3
```
2. **Configure kubectl:**
```bash
aws eks update-kubeconfig --name datalake-cluster --region us-east-1
```
3. **Deploy application:**
```bash
helm install datalake ./helm \
--namespace datalake \
--create-namespace \
--set config.storageBackend=s3
```
### Using RDS for PostgreSQL
```bash
helm install datalake ./helm \
--namespace datalake \
--create-namespace \
--set postgresql.enabled=false \
--set config.databaseUrl="postgresql://user:pass@your-rds-endpoint:5432/datalake"
```
---
## Self-Hosted Deployment
### Using MinIO
1. **Deploy MinIO:**
```bash
helm install minio bitnami/minio \
--namespace datalake \
--create-namespace \
--set auth.rootUser=admin \
--set auth.rootPassword=adminpassword \
--set persistence.size=100Gi
```
2. **Deploy application:**
```bash
helm install datalake ./helm \
--namespace datalake \
--set config.storageBackend=minio \
--set minio.enabled=false \
--set minio.endpoint=minio:9000 \
--set minio.accessKey=admin \
--set minio.secretKey=adminpassword
```
### On-Premise Kubernetes
1. **Prepare persistent volumes:**
```yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: datalake-postgres-pv
spec:
capacity:
storage: 20Gi
accessModes:
- ReadWriteOnce
hostPath:
path: /data/postgres
```
2. **Deploy with local storage:**
```bash
helm install datalake ./helm \
--namespace datalake \
--create-namespace \
--set postgresql.persistence.storageClass=local-storage \
--set minio.persistence.storageClass=local-storage
```
---
## GitLab CI/CD
### Setup
1. **Configure GitLab variables:**
Go to Settings → CI/CD → Variables and add:
| Variable | Description | Protected | Masked |
|----------|-------------|-----------|---------|
| `CI_REGISTRY_USER` | Docker registry username | No | No |
| `CI_REGISTRY_PASSWORD` | Docker registry password | No | Yes |
| `KUBE_CONFIG_DEV` | Base64 kubeconfig for dev | No | Yes |
| `KUBE_CONFIG_STAGING` | Base64 kubeconfig for staging | Yes | Yes |
| `KUBE_CONFIG_PROD` | Base64 kubeconfig for prod | Yes | Yes |
2. **Encode kubeconfig:**
```bash
cat ~/.kube/config | base64 -w 0
```
### Pipeline Stages
1. **Test**: Runs on all branches and MRs
2. **Build**: Builds Docker image on main/develop/tags
3. **Deploy**: Manual deployment to dev/staging/prod
### Deployment Flow
**Development:**
```bash
git push origin develop
# Manually trigger deploy:dev job in GitLab
```
**Staging:**
```bash
git push origin main
# Manually trigger deploy:staging job in GitLab
```
**Production:**
```bash
git tag v1.0.0
git push origin v1.0.0
# Manually trigger deploy:prod job in GitLab
```
### Customizing Pipeline
Edit `.gitlab-ci.yml` to:
- Add more test stages
- Change deployment namespaces
- Adjust Helm values per environment
- Add security scanning
- Configure rollback procedures
---
## Monitoring
### Health Checks
```bash
# Kubernetes
kubectl get pods -n datalake
kubectl logs -f -n datalake deployment/datalake
# Direct
curl http://localhost:8000/health
```
### Metrics
Add Prometheus monitoring:
```bash
helm install datalake ./helm \
--set metrics.enabled=true \
--set serviceMonitor.enabled=true
```
---
## Backup and Recovery
### Database Backup
```bash
# PostgreSQL
kubectl exec -n datalake deployment/datalake-postgresql -- \
pg_dump -U user datalake > backup.sql
# Restore
kubectl exec -i -n datalake deployment/datalake-postgresql -- \
psql -U user datalake < backup.sql
```
### Storage Backup
**S3:**
```bash
aws s3 sync s3://your-bucket s3://backup-bucket
```
**MinIO:**
```bash
mc mirror minio/test-artifacts backup/test-artifacts
```
---
## Troubleshooting
### Pod Not Starting
```bash
kubectl describe pod -n datalake <pod-name>
kubectl logs -n datalake <pod-name>
```
### Database Connection Issues
```bash
kubectl exec -it -n datalake deployment/datalake -- \
psql $DATABASE_URL
```
### Storage Issues
```bash
# Check MinIO
kubectl port-forward -n datalake svc/minio 9000:9000
# Access http://localhost:9000
```
---
## Security Considerations
1. **Use secrets management:**
- Kubernetes Secrets
- AWS Secrets Manager
- HashiCorp Vault
2. **Enable TLS:**
- Configure ingress with TLS certificates
- Use cert-manager for automatic certificates
3. **Network policies:**
- Restrict pod-to-pod communication
- Limit external access
4. **RBAC:**
- Configure Kubernetes RBAC
- Limit service account permissions
---
## Performance Tuning
### Database
- Increase connection pool size
- Add database indexes
- Configure autovacuum
### API
- Increase replica count
- Configure horizontal pod autoscaling
- Adjust resource requests/limits
### Storage
- Use CDN for frequently accessed files
- Configure S3 Transfer Acceleration
- Optimize MinIO deployment