Files

Mondo Diaz 5cd92ad89a Add integrity verification documentation

Document how content-addressable storage and integrity verification works:
- SHA256 hashing and content-addressable storage overview
- Client-side verification steps (before upload, after download)
- Server-side consistency check endpoint and scheduling
- Recovery procedures for corrupted, missing, or orphaned artifacts
- CI/CD integration examples

2026-01-16 18:39:31 +00:00

8.4 KiB

Raw Blame History

Integrity Verification

Orchard uses content-addressable storage with SHA256 hashing to ensure artifact integrity. This document describes how integrity verification works and how to use it.

How It Works

Content-Addressable Storage

Orchard stores artifacts using their SHA256 hash as the unique identifier. This provides several benefits:

Automatic deduplication: Identical content is stored only once
Built-in integrity: The artifact ID is the content hash
Tamper detection: Any modification changes the hash, making corruption detectable

When you upload a file:

Orchard computes the SHA256 hash of the content
The hash becomes the artifact ID (64-character hex string)
The file is stored in S3 at fruits/{hash[0:2]}/{hash[2:4]}/{hash}
The hash and metadata are recorded in the database

Hash Format

Algorithm: SHA256
Format: 64-character lowercase hexadecimal string
Example: dffd6021bb2bd5b0af676290809ec3a53191dd81c7f70a4b28688a362182986f

Client-Side Verification

Before Upload

Compute the hash locally before uploading to verify the server received your content correctly:

import hashlib

def compute_sha256(content: bytes) -> str:
    return hashlib.sha256(content).hexdigest()

# Compute hash before upload
content = open("myfile.tar.gz", "rb").read()
local_hash = compute_sha256(content)

# Upload the file
response = requests.post(
    f"{base_url}/api/v1/project/{project}/{package}/upload",
    files={"file": ("myfile.tar.gz", content)},
)
result = response.json()

# Verify server computed the same hash
assert result["artifact_id"] == local_hash, "Hash mismatch!"

Providing Expected Hash on Upload

You can provide the expected hash in the upload request. The server will reject the upload if the computed hash doesn't match:

response = requests.post(
    f"{base_url}/api/v1/project/{project}/{package}/upload",
    files={"file": ("myfile.tar.gz", content)},
    headers={"X-Checksum-SHA256": local_hash},
)

# Returns 422 if hash doesn't match
if response.status_code == 422:
    print("Checksum mismatch - upload rejected")

After Download

Verify downloaded content matches the expected hash using response headers:

response = requests.get(
    f"{base_url}/api/v1/project/{project}/{package}/+/{tag}",
    params={"mode": "proxy"},
)

# Get expected hash from header
expected_hash = response.headers.get("X-Checksum-SHA256")

# Compute hash of downloaded content
actual_hash = compute_sha256(response.content)

# Verify
if actual_hash != expected_hash:
    raise Exception(f"Integrity check failed! Expected {expected_hash}, got {actual_hash}")

Response Headers for Verification

Download responses include multiple headers for verification:

Header	Format	Description
`X-Checksum-SHA256`	Hex string	SHA256 hash (64 chars)
`ETag`	`"<hash>"`	SHA256 hash in quotes
`Digest`	`sha-256=<base64>`	RFC 3230 format (base64-encoded)
`Content-Length`	Integer	File size in bytes

Server-Side Verification on Download

Request server-side verification during download:

# Pre-verification: Server verifies before streaming (returns 500 if corrupt)
curl "${base_url}/api/v1/project/${project}/${package}/+/${tag}?mode=proxy&verify=true&verify_mode=pre"

# Stream verification: Server verifies while streaming (logs error if corrupt)
curl "${base_url}/api/v1/project/${project}/${package}/+/${tag}?mode=proxy&verify=true&verify_mode=stream"

The X-Verified header indicates whether server-side verification was performed:

X-Verified: true - Content was verified by the server

Server-Side Consistency Check

Consistency Check Endpoint

Administrators can run a consistency check to verify all stored artifacts:

curl "${base_url}/api/v1/admin/consistency-check"

Response:

{
  "total_artifacts_checked": 1234,
  "healthy": true,
  "orphaned_s3_objects": 0,
  "missing_s3_objects": 0,
  "size_mismatches": 0,
  "orphaned_s3_keys": [],
  "missing_s3_keys": [],
  "size_mismatch_artifacts": []
}

What the Check Verifies

Missing S3 objects: Database records with no corresponding S3 object
Orphaned S3 objects: S3 objects with no database record
Size mismatches: S3 object size doesn't match database record

Running Consistency Checks

Manual check:

# Check all artifacts
curl "${base_url}/api/v1/admin/consistency-check"

# Limit results (for large deployments)
curl "${base_url}/api/v1/admin/consistency-check?limit=100"

Scheduled checks (recommended):

Set up a cron job or Kubernetes CronJob to run periodic checks:

# Kubernetes CronJob example
apiVersion: batch/v1
kind: CronJob
metadata:
  name: orchard-consistency-check
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: check
            image: curlimages/curl
            command:
            - /bin/sh
            - -c
            - |
              response=$(curl -s "${ORCHARD_URL}/api/v1/admin/consistency-check")
              healthy=$(echo "$response" | jq -r '.healthy')
              if [ "$healthy" != "true" ]; then
                echo "ALERT: Consistency check failed!"
                echo "$response"
                exit 1
              fi
              echo "Consistency check passed"
          restartPolicy: OnFailure

Recovery Procedures

Corrupted Artifact (Size Mismatch)

If the consistency check reports size mismatches:

Identify affected artifacts:

curl "${base_url}/api/v1/admin/consistency-check" | jq '.size_mismatch_artifacts'

Check if artifact can be re-uploaded:
- If the original content is available, delete the corrupted artifact and re-upload
- The same content will produce the same artifact ID
If original content is lost:
- The artifact data is corrupted and cannot be recovered
- Delete the artifact record and notify affected users
- Consider restoring from backup if available

Missing S3 Object

If database records exist but S3 objects are missing: