Files
orchard/docs/integrity-verification.md
Mondo Diaz 5cd92ad89a Add integrity verification documentation
Document how content-addressable storage and integrity verification works:
- SHA256 hashing and content-addressable storage overview
- Client-side verification steps (before upload, after download)
- Server-side consistency check endpoint and scheduling
- Recovery procedures for corrupted, missing, or orphaned artifacts
- CI/CD integration examples
2026-01-16 18:39:31 +00:00

8.4 KiB

Integrity Verification

Orchard uses content-addressable storage with SHA256 hashing to ensure artifact integrity. This document describes how integrity verification works and how to use it.

How It Works

Content-Addressable Storage

Orchard stores artifacts using their SHA256 hash as the unique identifier. This provides several benefits:

  1. Automatic deduplication: Identical content is stored only once
  2. Built-in integrity: The artifact ID is the content hash
  3. Tamper detection: Any modification changes the hash, making corruption detectable

When you upload a file:

  1. Orchard computes the SHA256 hash of the content
  2. The hash becomes the artifact ID (64-character hex string)
  3. The file is stored in S3 at fruits/{hash[0:2]}/{hash[2:4]}/{hash}
  4. The hash and metadata are recorded in the database

Hash Format

  • Algorithm: SHA256
  • Format: 64-character lowercase hexadecimal string
  • Example: dffd6021bb2bd5b0af676290809ec3a53191dd81c7f70a4b28688a362182986f

Client-Side Verification

Before Upload

Compute the hash locally before uploading to verify the server received your content correctly:

import hashlib

def compute_sha256(content: bytes) -> str:
    return hashlib.sha256(content).hexdigest()

# Compute hash before upload
content = open("myfile.tar.gz", "rb").read()
local_hash = compute_sha256(content)

# Upload the file
response = requests.post(
    f"{base_url}/api/v1/project/{project}/{package}/upload",
    files={"file": ("myfile.tar.gz", content)},
)
result = response.json()

# Verify server computed the same hash
assert result["artifact_id"] == local_hash, "Hash mismatch!"

Providing Expected Hash on Upload

You can provide the expected hash in the upload request. The server will reject the upload if the computed hash doesn't match:

response = requests.post(
    f"{base_url}/api/v1/project/{project}/{package}/upload",
    files={"file": ("myfile.tar.gz", content)},
    headers={"X-Checksum-SHA256": local_hash},
)

# Returns 422 if hash doesn't match
if response.status_code == 422:
    print("Checksum mismatch - upload rejected")

After Download

Verify downloaded content matches the expected hash using response headers:

response = requests.get(
    f"{base_url}/api/v1/project/{project}/{package}/+/{tag}",
    params={"mode": "proxy"},
)

# Get expected hash from header
expected_hash = response.headers.get("X-Checksum-SHA256")

# Compute hash of downloaded content
actual_hash = compute_sha256(response.content)

# Verify
if actual_hash != expected_hash:
    raise Exception(f"Integrity check failed! Expected {expected_hash}, got {actual_hash}")

Response Headers for Verification

Download responses include multiple headers for verification:

Header Format Description
X-Checksum-SHA256 Hex string SHA256 hash (64 chars)
ETag "<hash>" SHA256 hash in quotes
Digest sha-256=<base64> RFC 3230 format (base64-encoded)
Content-Length Integer File size in bytes

Server-Side Verification on Download

Request server-side verification during download:

# Pre-verification: Server verifies before streaming (returns 500 if corrupt)
curl "${base_url}/api/v1/project/${project}/${package}/+/${tag}?mode=proxy&verify=true&verify_mode=pre"

# Stream verification: Server verifies while streaming (logs error if corrupt)
curl "${base_url}/api/v1/project/${project}/${package}/+/${tag}?mode=proxy&verify=true&verify_mode=stream"

The X-Verified header indicates whether server-side verification was performed:

  • X-Verified: true - Content was verified by the server

Server-Side Consistency Check

Consistency Check Endpoint

Administrators can run a consistency check to verify all stored artifacts:

curl "${base_url}/api/v1/admin/consistency-check"

Response:

{
  "total_artifacts_checked": 1234,
  "healthy": true,
  "orphaned_s3_objects": 0,
  "missing_s3_objects": 0,
  "size_mismatches": 0,
  "orphaned_s3_keys": [],
  "missing_s3_keys": [],
  "size_mismatch_artifacts": []
}

What the Check Verifies

  1. Missing S3 objects: Database records with no corresponding S3 object
  2. Orphaned S3 objects: S3 objects with no database record
  3. Size mismatches: S3 object size doesn't match database record

Running Consistency Checks

Manual check:

# Check all artifacts
curl "${base_url}/api/v1/admin/consistency-check"

# Limit results (for large deployments)
curl "${base_url}/api/v1/admin/consistency-check?limit=100"

Scheduled checks (recommended):

Set up a cron job or Kubernetes CronJob to run periodic checks:

# Kubernetes CronJob example
apiVersion: batch/v1
kind: CronJob
metadata:
  name: orchard-consistency-check
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: check
            image: curlimages/curl
            command:
            - /bin/sh
            - -c
            - |
              response=$(curl -s "${ORCHARD_URL}/api/v1/admin/consistency-check")
              healthy=$(echo "$response" | jq -r '.healthy')
              if [ "$healthy" != "true" ]; then
                echo "ALERT: Consistency check failed!"
                echo "$response"
                exit 1
              fi
              echo "Consistency check passed"
          restartPolicy: OnFailure

Recovery Procedures

Corrupted Artifact (Size Mismatch)

If the consistency check reports size mismatches:

  1. Identify affected artifacts:

    curl "${base_url}/api/v1/admin/consistency-check" | jq '.size_mismatch_artifacts'
    
  2. Check if artifact can be re-uploaded:

    • If the original content is available, delete the corrupted artifact and re-upload
    • The same content will produce the same artifact ID
  3. If original content is lost:

    • The artifact data is corrupted and cannot be recovered
    • Delete the artifact record and notify affected users
    • Consider restoring from backup if available

Missing S3 Object

If database records exist but S3 objects are missing:

  1. Identify affected artifacts:

    curl "${base_url}/api/v1/admin/consistency-check" | jq '.missing_s3_keys'
    
  2. Check S3 bucket:

    • Verify the S3 bucket exists and is accessible
    • Check S3 access logs for deletion events
    • Check if objects were moved or lifecycle-deleted
  3. Recovery options:

    • Restore from S3 versioning (if enabled)
    • Restore from backup
    • Re-upload original content (if available)
    • Delete orphaned database records

Orphaned S3 Objects

If S3 objects exist without database records:

  1. Identify orphaned objects:

    curl "${base_url}/api/v1/admin/consistency-check" | jq '.orphaned_s3_keys'
    
  2. Investigate cause:

    • Upload interrupted before database commit?
    • Database record deleted but S3 cleanup failed?
  3. Resolution:

    • If content is needed, create database record manually
    • If content is not needed, delete the S3 object to reclaim storage

Preventive Measures

  1. Enable S3 versioning to recover from accidental deletions
  2. Regular backups of both database and S3 bucket
  3. Scheduled consistency checks to detect issues early
  4. Monitoring and alerting on consistency check failures
  5. Audit logging to track all artifact operations

Verification in CI/CD

Verifying Artifacts in Pipelines

#!/bin/bash
# Download and verify artifact in CI pipeline

ARTIFACT_URL="${ORCHARD_URL}/api/v1/project/${PROJECT}/${PACKAGE}/+/${TAG}"

# Download with verification headers
response=$(curl -s -D - "${ARTIFACT_URL}?mode=proxy" -o artifact.tar.gz)
expected_hash=$(echo "$response" | grep -i "X-Checksum-SHA256" | cut -d: -f2 | tr -d ' \r')

# Compute actual hash
actual_hash=$(sha256sum artifact.tar.gz | cut -d' ' -f1)

# Verify
if [ "$actual_hash" != "$expected_hash" ]; then
    echo "ERROR: Integrity check failed!"
    echo "Expected: $expected_hash"
    echo "Actual:   $actual_hash"
    exit 1
fi

echo "Integrity verified: $actual_hash"

Using Server-Side Verification

For critical deployments, use server-side pre-verification:

# Server verifies before streaming - returns 500 if corrupt
curl -f "${ARTIFACT_URL}?mode=proxy&verify=true&verify_mode=pre" -o artifact.tar.gz

This ensures the artifact is verified before any bytes are streamed to your pipeline.