Add integrity verification documentation

Document how content-addressable storage and integrity verification works: - SHA256 hashing and content-addressable storage overview - Client-side verification steps (before upload, after download) - Server-side consistency check endpoint and scheduling - Recovery procedures for corrupted, missing, or orphaned artifacts - CI/CD integration examples
2026-01-16 18:39:31 +00:00
parent bce27c43f3
commit 5cd92ad89a
2 changed files with 295 additions and 0 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -23,6 +23,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Added consistency check endpoint tests with response format validation
 - Added corruption detection tests: bit flip, truncation, appended content, size mismatch, missing S3 objects
 - Added Digest header tests (RFC 3230) and verification mode tests
+- Added integrity verification documentation (`docs/integrity-verification.md`)
 - Added `package_versions` table for immutable version tracking separate from mutable tags (#56)
  - Versions are set at upload time via explicit `version` parameter or auto-detected from filename/metadata
  - Version detection priority: explicit parameter > package metadata > filename pattern
--- a/docs/integrity-verification.md
+++ b/docs/integrity-verification.md
@@ -0,0 +1,294 @@
+# Integrity Verification
+
+Orchard uses content-addressable storage with SHA256 hashing to ensure artifact integrity. This document describes how integrity verification works and how to use it.
+
+## How It Works
+
+### Content-Addressable Storage
+
+Orchard stores artifacts using their SHA256 hash as the unique identifier. This provides several benefits:
+
+1. **Automatic deduplication**: Identical content is stored only once
+2. **Built-in integrity**: The artifact ID *is* the content hash
+3. **Tamper detection**: Any modification changes the hash, making corruption detectable
+
+When you upload a file:
+1. Orchard computes the SHA256 hash of the content
+2. The hash becomes the artifact ID (64-character hex string)
+3. The file is stored in S3 at `fruits/{hash[0:2]}/{hash[2:4]}/{hash}`
+4. The hash and metadata are recorded in the database
+
+### Hash Format
+
+- Algorithm: SHA256
+- Format: 64-character lowercase hexadecimal string
+- Example: `dffd6021bb2bd5b0af676290809ec3a53191dd81c7f70a4b28688a362182986f`
+
+## Client-Side Verification
+
+### Before Upload
+
+Compute the hash locally before uploading to verify the server received your content correctly:
+
+```python
+import hashlib
+
+def compute_sha256(content: bytes) -> str:
+    return hashlib.sha256(content).hexdigest()
+
+# Compute hash before upload
+content = open("myfile.tar.gz", "rb").read()
+local_hash = compute_sha256(content)
+
+# Upload the file
+response = requests.post(
+    f"{base_url}/api/v1/project/{project}/{package}/upload",
+    files={"file": ("myfile.tar.gz", content)},
+)
+result = response.json()
+
+# Verify server computed the same hash
+assert result["artifact_id"] == local_hash, "Hash mismatch!"
+```
+
+### Providing Expected Hash on Upload
+
+You can provide the expected hash in the upload request. The server will reject the upload if the computed hash doesn't match:
+
+```python
+response = requests.post(
+    f"{base_url}/api/v1/project/{project}/{package}/upload",
+    files={"file": ("myfile.tar.gz", content)},
+    headers={"X-Checksum-SHA256": local_hash},
+)
+
+# Returns 422 if hash doesn't match
+if response.status_code == 422:
+    print("Checksum mismatch - upload rejected")
+```
+
+### After Download
+
+Verify downloaded content matches the expected hash using response headers:
+
+```python
+response = requests.get(
+    f"{base_url}/api/v1/project/{project}/{package}/+/{tag}",
+    params={"mode": "proxy"},
+)
+
+# Get expected hash from header
+expected_hash = response.headers.get("X-Checksum-SHA256")
+
+# Compute hash of downloaded content
+actual_hash = compute_sha256(response.content)
+
+# Verify
+if actual_hash != expected_hash:
+    raise Exception(f"Integrity check failed! Expected {expected_hash}, got {actual_hash}")
+```
+
+### Response Headers for Verification
+
+Download responses include multiple headers for verification:
+
+| Header | Format | Description |
+|--------|--------|-------------|
+| `X-Checksum-SHA256` | Hex string | SHA256 hash (64 chars) |
+| `ETag` | `"<hash>"` | SHA256 hash in quotes |
+| `Digest` | `sha-256=<base64>` | RFC 3230 format (base64-encoded) |
+| `Content-Length` | Integer | File size in bytes |
+
+### Server-Side Verification on Download
+
+Request server-side verification during download:
+
+```bash
+# Pre-verification: Server verifies before streaming (returns 500 if corrupt)
+curl "${base_url}/api/v1/project/${project}/${package}/+/${tag}?mode=proxy&verify=true&verify_mode=pre"
+
+# Stream verification: Server verifies while streaming (logs error if corrupt)
+curl "${base_url}/api/v1/project/${project}/${package}/+/${tag}?mode=proxy&verify=true&verify_mode=stream"
+```
+
+The `X-Verified` header indicates whether server-side verification was performed:
+- `X-Verified: true` - Content was verified by the server
+
+## Server-Side Consistency Check
+
+### Consistency Check Endpoint
+
+Administrators can run a consistency check to verify all stored artifacts:
+
+```bash
+curl "${base_url}/api/v1/admin/consistency-check"
+```
+
+Response:
+```json
+{
+  "total_artifacts_checked": 1234,
+  "healthy": true,
+  "orphaned_s3_objects": 0,
+  "missing_s3_objects": 0,
+  "size_mismatches": 0,
+  "orphaned_s3_keys": [],
+  "missing_s3_keys": [],
+  "size_mismatch_artifacts": []
+}
+```
+
+### What the Check Verifies
+
+1. **Missing S3 objects**: Database records with no corresponding S3 object
+2. **Orphaned S3 objects**: S3 objects with no database record
+3. **Size mismatches**: S3 object size doesn't match database record
+
+### Running Consistency Checks
+
+**Manual check:**
+```bash
+# Check all artifacts
+curl "${base_url}/api/v1/admin/consistency-check"
+
+# Limit results (for large deployments)
+curl "${base_url}/api/v1/admin/consistency-check?limit=100"
+```
+
+**Scheduled checks (recommended):**
+
+Set up a cron job or Kubernetes CronJob to run periodic checks:
+
+```yaml
+# Kubernetes CronJob example
+apiVersion: batch/v1
+kind: CronJob
+metadata:
+  name: orchard-consistency-check
+spec:
+  schedule: "0 2 * * *"  # Daily at 2 AM
+  jobTemplate:
+    spec:
+      template:
+        spec:
+          containers:
+          - name: check
+            image: curlimages/curl
+            command:
+            - /bin/sh
+            - -c
+            - |
+              response=$(curl -s "${ORCHARD_URL}/api/v1/admin/consistency-check")
+              healthy=$(echo "$response" | jq -r '.healthy')
+              if [ "$healthy" != "true" ]; then
+                echo "ALERT: Consistency check failed!"
+                echo "$response"
+                exit 1
+              fi
+              echo "Consistency check passed"
+          restartPolicy: OnFailure
+```
+
+## Recovery Procedures
+
+### Corrupted Artifact (Size Mismatch)
+
+If the consistency check reports size mismatches:
+
+1. **Identify affected artifacts:**
+   ```bash
+   curl "${base_url}/api/v1/admin/consistency-check" | jq '.size_mismatch_artifacts'
+   ```
+
+2. **Check if artifact can be re-uploaded:**
+   - If the original content is available, delete the corrupted artifact and re-upload
+   - The same content will produce the same artifact ID
+
+3. **If original content is lost:**
+   - The artifact data is corrupted and cannot be recovered
+   - Delete the artifact record and notify affected users
+   - Consider restoring from backup if available
+
+### Missing S3 Object
+
+If database records exist but S3 objects are missing:
+
+1. **Identify affected artifacts:**
+   ```bash
+   curl "${base_url}/api/v1/admin/consistency-check" | jq '.missing_s3_keys'
+   ```
+
+2. **Check S3 bucket:**
+   - Verify the S3 bucket exists and is accessible
+   - Check S3 access logs for deletion events
+   - Check if objects were moved or lifecycle-deleted
+
+3. **Recovery options:**
+   - Restore from S3 versioning (if enabled)
+   - Restore from backup
+   - Re-upload original content (if available)
+   - Delete orphaned database records
+
+### Orphaned S3 Objects
+
+If S3 objects exist without database records:
+
+1. **Identify orphaned objects:**
+   ```bash
+   curl "${base_url}/api/v1/admin/consistency-check" | jq '.orphaned_s3_keys'
+   ```
+
+2. **Investigate cause:**
+   - Upload interrupted before database commit?
+   - Database record deleted but S3 cleanup failed?
+
+3. **Resolution:**
+   - If content is needed, create database record manually
+   - If content is not needed, delete the S3 object to reclaim storage
+
+### Preventive Measures
+
+1. **Enable S3 versioning** to recover from accidental deletions
+2. **Regular backups** of both database and S3 bucket
+3. **Scheduled consistency checks** to detect issues early
+4. **Monitoring and alerting** on consistency check failures
+5. **Audit logging** to track all artifact operations
+
+## Verification in CI/CD
+
+### Verifying Artifacts in Pipelines
+
+```bash
+#!/bin/bash
+# Download and verify artifact in CI pipeline
+
+ARTIFACT_URL="${ORCHARD_URL}/api/v1/project/${PROJECT}/${PACKAGE}/+/${TAG}"
+
+# Download with verification headers
+response=$(curl -s -D - "${ARTIFACT_URL}?mode=proxy" -o artifact.tar.gz)
+expected_hash=$(echo "$response" | grep -i "X-Checksum-SHA256" | cut -d: -f2 | tr -d ' \r')
+
+# Compute actual hash
+actual_hash=$(sha256sum artifact.tar.gz | cut -d' ' -f1)
+
+# Verify
+if [ "$actual_hash" != "$expected_hash" ]; then
+    echo "ERROR: Integrity check failed!"
+    echo "Expected: $expected_hash"
+    echo "Actual:   $actual_hash"
+    exit 1
+fi
+
+echo "Integrity verified: $actual_hash"
+```
+
+### Using Server-Side Verification
+
+For critical deployments, use server-side pre-verification:
+
+```bash
+# Server verifies before streaming - returns 500 if corrupt
+curl -f "${ARTIFACT_URL}?mode=proxy&verify=true&verify_mode=pre" -o artifact.tar.gz
+```
+
+This ensures the artifact is verified before any bytes are streamed to your pipeline.