# Deployment Guide This guide covers deploying the Test Artifact Data Lake in various environments. ## Table of Contents - [Local Development](#local-development) - [Docker Compose](#docker-compose) - [Kubernetes/Helm](#kuberneteshelm) - [AWS Deployment](#aws-deployment) - [Self-Hosted Deployment](#self-hosted-deployment) - [GitLab CI/CD](#gitlab-cicd) --- ## Local Development ### Prerequisites - Python 3.11+ - PostgreSQL 15+ - MinIO or AWS S3 access ### Steps 1. **Create virtual environment:** ```bash python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate ``` 2. **Install dependencies:** ```bash pip install -r requirements.txt ``` 3. **Set up PostgreSQL:** ```bash createdb datalake ``` 4. **Configure environment:** ```bash cp .env.example .env # Edit .env with your configuration ``` 5. **Run the application:** ```bash python -m uvicorn app.main:app --reload ``` --- ## Docker Compose ### Quick Start 1. **Start all services:** ```bash docker-compose up -d ``` 2. **Check logs:** ```bash docker-compose logs -f api ``` 3. **Stop services:** ```bash docker-compose down ``` ### Services Included - PostgreSQL (port 5432) - MinIO (port 9000, console 9001) - API (port 8000) ### Customization Edit `docker-compose.yml` to: - Change port mappings - Adjust resource limits - Add environment variables - Configure volumes --- ## Kubernetes/Helm ### Prerequisites - Kubernetes cluster (1.24+) - Helm 3.x - kubectl configured ### Installation 1. **Add dependencies (if using PostgreSQL/MinIO from Bitnami):** ```bash helm repo add bitnami https://charts.bitnami.com/bitnami helm repo update ``` 2. **Install with default values:** ```bash helm install datalake ./helm \ --namespace datalake \ --create-namespace ``` 3. **Custom installation:** ```bash helm install datalake ./helm \ --namespace datalake \ --create-namespace \ --set image.repository=your-registry/datalake \ --set image.tag=1.0.0 \ --set ingress.enabled=true \ --set ingress.hosts[0].host=datalake.yourdomain.com ``` ### Configuration Options **Image:** ```bash --set image.repository=your-registry/datalake --set image.tag=1.0.0 --set image.pullPolicy=Always ``` **Resources:** ```bash --set resources.requests.cpu=1000m --set resources.requests.memory=1Gi --set resources.limits.cpu=2000m --set resources.limits.memory=2Gi ``` **Autoscaling:** ```bash --set autoscaling.enabled=true --set autoscaling.minReplicas=3 --set autoscaling.maxReplicas=10 --set autoscaling.targetCPUUtilizationPercentage=80 ``` **Ingress:** ```bash --set ingress.enabled=true --set ingress.className=nginx --set ingress.hosts[0].host=datalake.example.com --set ingress.hosts[0].paths[0].path=/ --set ingress.hosts[0].paths[0].pathType=Prefix ``` ### Upgrade ```bash helm upgrade datalake ./helm \ --namespace datalake \ --set image.tag=1.1.0 ``` ### Uninstall ```bash helm uninstall datalake --namespace datalake ``` --- ## AWS Deployment ### Using AWS S3 Storage 1. **Create S3 bucket:** ```bash aws s3 mb s3://your-test-artifacts-bucket ``` 2. **Create IAM user with S3 access:** ```bash aws iam create-user --user-name datalake-service aws iam attach-user-policy --user-name datalake-service \ --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess ``` 3. **Generate access keys:** ```bash aws iam create-access-key --user-name datalake-service ``` 4. **Deploy with Helm:** ```bash helm install datalake ./helm \ --namespace datalake \ --create-namespace \ --set config.storageBackend=s3 \ --set aws.enabled=true \ --set aws.accessKeyId=YOUR_ACCESS_KEY \ --set aws.secretAccessKey=YOUR_SECRET_KEY \ --set aws.region=us-east-1 \ --set aws.bucketName=your-test-artifacts-bucket \ --set minio.enabled=false ``` ### Using EKS 1. **Create EKS cluster:** ```bash eksctl create cluster \ --name datalake-cluster \ --region us-east-1 \ --nodegroup-name standard-workers \ --node-type t3.medium \ --nodes 3 ``` 2. **Configure kubectl:** ```bash aws eks update-kubeconfig --name datalake-cluster --region us-east-1 ``` 3. **Deploy application:** ```bash helm install datalake ./helm \ --namespace datalake \ --create-namespace \ --set config.storageBackend=s3 ``` ### Using RDS for PostgreSQL ```bash helm install datalake ./helm \ --namespace datalake \ --create-namespace \ --set postgresql.enabled=false \ --set config.databaseUrl="postgresql://user:pass@your-rds-endpoint:5432/datalake" ``` --- ## Self-Hosted Deployment ### Using MinIO 1. **Deploy MinIO:** ```bash helm install minio bitnami/minio \ --namespace datalake \ --create-namespace \ --set auth.rootUser=admin \ --set auth.rootPassword=adminpassword \ --set persistence.size=100Gi ``` 2. **Deploy application:** ```bash helm install datalake ./helm \ --namespace datalake \ --set config.storageBackend=minio \ --set minio.enabled=false \ --set minio.endpoint=minio:9000 \ --set minio.accessKey=admin \ --set minio.secretKey=adminpassword ``` ### On-Premise Kubernetes 1. **Prepare persistent volumes:** ```yaml apiVersion: v1 kind: PersistentVolume metadata: name: datalake-postgres-pv spec: capacity: storage: 20Gi accessModes: - ReadWriteOnce hostPath: path: /data/postgres ``` 2. **Deploy with local storage:** ```bash helm install datalake ./helm \ --namespace datalake \ --create-namespace \ --set postgresql.persistence.storageClass=local-storage \ --set minio.persistence.storageClass=local-storage ``` --- ## GitLab CI/CD ### Setup 1. **Configure GitLab variables:** Go to Settings → CI/CD → Variables and add: | Variable | Description | Protected | Masked | |----------|-------------|-----------|---------| | `CI_REGISTRY_USER` | Docker registry username | No | No | | `CI_REGISTRY_PASSWORD` | Docker registry password | No | Yes | | `KUBE_CONFIG_DEV` | Base64 kubeconfig for dev | No | Yes | | `KUBE_CONFIG_STAGING` | Base64 kubeconfig for staging | Yes | Yes | | `KUBE_CONFIG_PROD` | Base64 kubeconfig for prod | Yes | Yes | 2. **Encode kubeconfig:** ```bash cat ~/.kube/config | base64 -w 0 ``` ### Pipeline Stages 1. **Test**: Runs on all branches and MRs 2. **Build**: Builds Docker image on main/develop/tags 3. **Deploy**: Manual deployment to dev/staging/prod ### Deployment Flow **Development:** ```bash git push origin develop # Manually trigger deploy:dev job in GitLab ``` **Staging:** ```bash git push origin main # Manually trigger deploy:staging job in GitLab ``` **Production:** ```bash git tag v1.0.0 git push origin v1.0.0 # Manually trigger deploy:prod job in GitLab ``` ### Customizing Pipeline Edit `.gitlab-ci.yml` to: - Add more test stages - Change deployment namespaces - Adjust Helm values per environment - Add security scanning - Configure rollback procedures --- ## Monitoring ### Health Checks ```bash # Kubernetes kubectl get pods -n datalake kubectl logs -f -n datalake deployment/datalake # Direct curl http://localhost:8000/health ``` ### Metrics Add Prometheus monitoring: ```bash helm install datalake ./helm \ --set metrics.enabled=true \ --set serviceMonitor.enabled=true ``` --- ## Backup and Recovery ### Database Backup ```bash # PostgreSQL kubectl exec -n datalake deployment/datalake-postgresql -- \ pg_dump -U user datalake > backup.sql # Restore kubectl exec -i -n datalake deployment/datalake-postgresql -- \ psql -U user datalake < backup.sql ``` ### Storage Backup **S3:** ```bash aws s3 sync s3://your-bucket s3://backup-bucket ``` **MinIO:** ```bash mc mirror minio/test-artifacts backup/test-artifacts ``` --- ## Troubleshooting ### Pod Not Starting ```bash kubectl describe pod -n datalake kubectl logs -n datalake ``` ### Database Connection Issues ```bash kubectl exec -it -n datalake deployment/datalake -- \ psql $DATABASE_URL ``` ### Storage Issues ```bash # Check MinIO kubectl port-forward -n datalake svc/minio 9000:9000 # Access http://localhost:9000 ``` --- ## Security Considerations 1. **Use secrets management:** - Kubernetes Secrets - AWS Secrets Manager - HashiCorp Vault 2. **Enable TLS:** - Configure ingress with TLS certificates - Use cert-manager for automatic certificates 3. **Network policies:** - Restrict pod-to-pod communication - Limit external access 4. **RBAC:** - Configure Kubernetes RBAC - Limit service account permissions --- ## Performance Tuning ### Database - Increase connection pool size - Add database indexes - Configure autovacuum ### API - Increase replica count - Configure horizontal pod autoscaling - Adjust resource requests/limits ### Storage - Use CDN for frequently accessed files - Configure S3 Transfer Acceleration - Optimize MinIO deployment