diff --git a/CHANGELOG.md b/CHANGELOG.md index 50785ce..9017157 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -23,6 +23,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - Added consistency check endpoint tests with response format validation - Added corruption detection tests: bit flip, truncation, appended content, size mismatch, missing S3 objects - Added Digest header tests (RFC 3230) and verification mode tests +- Added integrity verification documentation (`docs/integrity-verification.md`) - Added `package_versions` table for immutable version tracking separate from mutable tags (#56) - Versions are set at upload time via explicit `version` parameter or auto-detected from filename/metadata - Version detection priority: explicit parameter > package metadata > filename pattern diff --git a/docs/integrity-verification.md b/docs/integrity-verification.md new file mode 100644 index 0000000..ced5b3d --- /dev/null +++ b/docs/integrity-verification.md @@ -0,0 +1,294 @@ +# Integrity Verification + +Orchard uses content-addressable storage with SHA256 hashing to ensure artifact integrity. This document describes how integrity verification works and how to use it. + +## How It Works + +### Content-Addressable Storage + +Orchard stores artifacts using their SHA256 hash as the unique identifier. This provides several benefits: + +1. **Automatic deduplication**: Identical content is stored only once +2. **Built-in integrity**: The artifact ID *is* the content hash +3. **Tamper detection**: Any modification changes the hash, making corruption detectable + +When you upload a file: +1. Orchard computes the SHA256 hash of the content +2. The hash becomes the artifact ID (64-character hex string) +3. The file is stored in S3 at `fruits/{hash[0:2]}/{hash[2:4]}/{hash}` +4. The hash and metadata are recorded in the database + +### Hash Format + +- Algorithm: SHA256 +- Format: 64-character lowercase hexadecimal string +- Example: `dffd6021bb2bd5b0af676290809ec3a53191dd81c7f70a4b28688a362182986f` + +## Client-Side Verification + +### Before Upload + +Compute the hash locally before uploading to verify the server received your content correctly: + +```python +import hashlib + +def compute_sha256(content: bytes) -> str: + return hashlib.sha256(content).hexdigest() + +# Compute hash before upload +content = open("myfile.tar.gz", "rb").read() +local_hash = compute_sha256(content) + +# Upload the file +response = requests.post( + f"{base_url}/api/v1/project/{project}/{package}/upload", + files={"file": ("myfile.tar.gz", content)}, +) +result = response.json() + +# Verify server computed the same hash +assert result["artifact_id"] == local_hash, "Hash mismatch!" +``` + +### Providing Expected Hash on Upload + +You can provide the expected hash in the upload request. The server will reject the upload if the computed hash doesn't match: + +```python +response = requests.post( + f"{base_url}/api/v1/project/{project}/{package}/upload", + files={"file": ("myfile.tar.gz", content)}, + headers={"X-Checksum-SHA256": local_hash}, +) + +# Returns 422 if hash doesn't match +if response.status_code == 422: + print("Checksum mismatch - upload rejected") +``` + +### After Download + +Verify downloaded content matches the expected hash using response headers: + +```python +response = requests.get( + f"{base_url}/api/v1/project/{project}/{package}/+/{tag}", + params={"mode": "proxy"}, +) + +# Get expected hash from header +expected_hash = response.headers.get("X-Checksum-SHA256") + +# Compute hash of downloaded content +actual_hash = compute_sha256(response.content) + +# Verify +if actual_hash != expected_hash: + raise Exception(f"Integrity check failed! Expected {expected_hash}, got {actual_hash}") +``` + +### Response Headers for Verification + +Download responses include multiple headers for verification: + +| Header | Format | Description | +|--------|--------|-------------| +| `X-Checksum-SHA256` | Hex string | SHA256 hash (64 chars) | +| `ETag` | `""` | SHA256 hash in quotes | +| `Digest` | `sha-256=` | RFC 3230 format (base64-encoded) | +| `Content-Length` | Integer | File size in bytes | + +### Server-Side Verification on Download + +Request server-side verification during download: + +```bash +# Pre-verification: Server verifies before streaming (returns 500 if corrupt) +curl "${base_url}/api/v1/project/${project}/${package}/+/${tag}?mode=proxy&verify=true&verify_mode=pre" + +# Stream verification: Server verifies while streaming (logs error if corrupt) +curl "${base_url}/api/v1/project/${project}/${package}/+/${tag}?mode=proxy&verify=true&verify_mode=stream" +``` + +The `X-Verified` header indicates whether server-side verification was performed: +- `X-Verified: true` - Content was verified by the server + +## Server-Side Consistency Check + +### Consistency Check Endpoint + +Administrators can run a consistency check to verify all stored artifacts: + +```bash +curl "${base_url}/api/v1/admin/consistency-check" +``` + +Response: +```json +{ + "total_artifacts_checked": 1234, + "healthy": true, + "orphaned_s3_objects": 0, + "missing_s3_objects": 0, + "size_mismatches": 0, + "orphaned_s3_keys": [], + "missing_s3_keys": [], + "size_mismatch_artifacts": [] +} +``` + +### What the Check Verifies + +1. **Missing S3 objects**: Database records with no corresponding S3 object +2. **Orphaned S3 objects**: S3 objects with no database record +3. **Size mismatches**: S3 object size doesn't match database record + +### Running Consistency Checks + +**Manual check:** +```bash +# Check all artifacts +curl "${base_url}/api/v1/admin/consistency-check" + +# Limit results (for large deployments) +curl "${base_url}/api/v1/admin/consistency-check?limit=100" +``` + +**Scheduled checks (recommended):** + +Set up a cron job or Kubernetes CronJob to run periodic checks: + +```yaml +# Kubernetes CronJob example +apiVersion: batch/v1 +kind: CronJob +metadata: + name: orchard-consistency-check +spec: + schedule: "0 2 * * *" # Daily at 2 AM + jobTemplate: + spec: + template: + spec: + containers: + - name: check + image: curlimages/curl + command: + - /bin/sh + - -c + - | + response=$(curl -s "${ORCHARD_URL}/api/v1/admin/consistency-check") + healthy=$(echo "$response" | jq -r '.healthy') + if [ "$healthy" != "true" ]; then + echo "ALERT: Consistency check failed!" + echo "$response" + exit 1 + fi + echo "Consistency check passed" + restartPolicy: OnFailure +``` + +## Recovery Procedures + +### Corrupted Artifact (Size Mismatch) + +If the consistency check reports size mismatches: + +1. **Identify affected artifacts:** + ```bash + curl "${base_url}/api/v1/admin/consistency-check" | jq '.size_mismatch_artifacts' + ``` + +2. **Check if artifact can be re-uploaded:** + - If the original content is available, delete the corrupted artifact and re-upload + - The same content will produce the same artifact ID + +3. **If original content is lost:** + - The artifact data is corrupted and cannot be recovered + - Delete the artifact record and notify affected users + - Consider restoring from backup if available + +### Missing S3 Object + +If database records exist but S3 objects are missing: + +1. **Identify affected artifacts:** + ```bash + curl "${base_url}/api/v1/admin/consistency-check" | jq '.missing_s3_keys' + ``` + +2. **Check S3 bucket:** + - Verify the S3 bucket exists and is accessible + - Check S3 access logs for deletion events + - Check if objects were moved or lifecycle-deleted + +3. **Recovery options:** + - Restore from S3 versioning (if enabled) + - Restore from backup + - Re-upload original content (if available) + - Delete orphaned database records + +### Orphaned S3 Objects + +If S3 objects exist without database records: + +1. **Identify orphaned objects:** + ```bash + curl "${base_url}/api/v1/admin/consistency-check" | jq '.orphaned_s3_keys' + ``` + +2. **Investigate cause:** + - Upload interrupted before database commit? + - Database record deleted but S3 cleanup failed? + +3. **Resolution:** + - If content is needed, create database record manually + - If content is not needed, delete the S3 object to reclaim storage + +### Preventive Measures + +1. **Enable S3 versioning** to recover from accidental deletions +2. **Regular backups** of both database and S3 bucket +3. **Scheduled consistency checks** to detect issues early +4. **Monitoring and alerting** on consistency check failures +5. **Audit logging** to track all artifact operations + +## Verification in CI/CD + +### Verifying Artifacts in Pipelines + +```bash +#!/bin/bash +# Download and verify artifact in CI pipeline + +ARTIFACT_URL="${ORCHARD_URL}/api/v1/project/${PROJECT}/${PACKAGE}/+/${TAG}" + +# Download with verification headers +response=$(curl -s -D - "${ARTIFACT_URL}?mode=proxy" -o artifact.tar.gz) +expected_hash=$(echo "$response" | grep -i "X-Checksum-SHA256" | cut -d: -f2 | tr -d ' \r') + +# Compute actual hash +actual_hash=$(sha256sum artifact.tar.gz | cut -d' ' -f1) + +# Verify +if [ "$actual_hash" != "$expected_hash" ]; then + echo "ERROR: Integrity check failed!" + echo "Expected: $expected_hash" + echo "Actual: $actual_hash" + exit 1 +fi + +echo "Integrity verified: $actual_hash" +``` + +### Using Server-Side Verification + +For critical deployments, use server-side pre-verification: + +```bash +# Server verifies before streaming - returns 500 if corrupt +curl -f "${ARTIFACT_URL}?mode=proxy&verify=true&verify_mode=pre" -o artifact.tar.gz +``` + +This ensures the artifact is verified before any bytes are streamed to your pipeline.