Document how content-addressable storage and integrity verification works: - SHA256 hashing and content-addressable storage overview - Client-side verification steps (before upload, after download) - Server-side consistency check endpoint and scheduling - Recovery procedures for corrupted, missing, or orphaned artifacts - CI/CD integration examples
8.4 KiB
Integrity Verification
Orchard uses content-addressable storage with SHA256 hashing to ensure artifact integrity. This document describes how integrity verification works and how to use it.
How It Works
Content-Addressable Storage
Orchard stores artifacts using their SHA256 hash as the unique identifier. This provides several benefits:
- Automatic deduplication: Identical content is stored only once
- Built-in integrity: The artifact ID is the content hash
- Tamper detection: Any modification changes the hash, making corruption detectable
When you upload a file:
- Orchard computes the SHA256 hash of the content
- The hash becomes the artifact ID (64-character hex string)
- The file is stored in S3 at
fruits/{hash[0:2]}/{hash[2:4]}/{hash} - The hash and metadata are recorded in the database
Hash Format
- Algorithm: SHA256
- Format: 64-character lowercase hexadecimal string
- Example:
dffd6021bb2bd5b0af676290809ec3a53191dd81c7f70a4b28688a362182986f
Client-Side Verification
Before Upload
Compute the hash locally before uploading to verify the server received your content correctly:
import hashlib
def compute_sha256(content: bytes) -> str:
return hashlib.sha256(content).hexdigest()
# Compute hash before upload
content = open("myfile.tar.gz", "rb").read()
local_hash = compute_sha256(content)
# Upload the file
response = requests.post(
f"{base_url}/api/v1/project/{project}/{package}/upload",
files={"file": ("myfile.tar.gz", content)},
)
result = response.json()
# Verify server computed the same hash
assert result["artifact_id"] == local_hash, "Hash mismatch!"
Providing Expected Hash on Upload
You can provide the expected hash in the upload request. The server will reject the upload if the computed hash doesn't match:
response = requests.post(
f"{base_url}/api/v1/project/{project}/{package}/upload",
files={"file": ("myfile.tar.gz", content)},
headers={"X-Checksum-SHA256": local_hash},
)
# Returns 422 if hash doesn't match
if response.status_code == 422:
print("Checksum mismatch - upload rejected")
After Download
Verify downloaded content matches the expected hash using response headers:
response = requests.get(
f"{base_url}/api/v1/project/{project}/{package}/+/{tag}",
params={"mode": "proxy"},
)
# Get expected hash from header
expected_hash = response.headers.get("X-Checksum-SHA256")
# Compute hash of downloaded content
actual_hash = compute_sha256(response.content)
# Verify
if actual_hash != expected_hash:
raise Exception(f"Integrity check failed! Expected {expected_hash}, got {actual_hash}")
Response Headers for Verification
Download responses include multiple headers for verification:
| Header | Format | Description |
|---|---|---|
X-Checksum-SHA256 |
Hex string | SHA256 hash (64 chars) |
ETag |
"<hash>" |
SHA256 hash in quotes |
Digest |
sha-256=<base64> |
RFC 3230 format (base64-encoded) |
Content-Length |
Integer | File size in bytes |
Server-Side Verification on Download
Request server-side verification during download:
# Pre-verification: Server verifies before streaming (returns 500 if corrupt)
curl "${base_url}/api/v1/project/${project}/${package}/+/${tag}?mode=proxy&verify=true&verify_mode=pre"
# Stream verification: Server verifies while streaming (logs error if corrupt)
curl "${base_url}/api/v1/project/${project}/${package}/+/${tag}?mode=proxy&verify=true&verify_mode=stream"
The X-Verified header indicates whether server-side verification was performed:
X-Verified: true- Content was verified by the server
Server-Side Consistency Check
Consistency Check Endpoint
Administrators can run a consistency check to verify all stored artifacts:
curl "${base_url}/api/v1/admin/consistency-check"
Response:
{
"total_artifacts_checked": 1234,
"healthy": true,
"orphaned_s3_objects": 0,
"missing_s3_objects": 0,
"size_mismatches": 0,
"orphaned_s3_keys": [],
"missing_s3_keys": [],
"size_mismatch_artifacts": []
}
What the Check Verifies
- Missing S3 objects: Database records with no corresponding S3 object
- Orphaned S3 objects: S3 objects with no database record
- Size mismatches: S3 object size doesn't match database record
Running Consistency Checks
Manual check:
# Check all artifacts
curl "${base_url}/api/v1/admin/consistency-check"
# Limit results (for large deployments)
curl "${base_url}/api/v1/admin/consistency-check?limit=100"
Scheduled checks (recommended):
Set up a cron job or Kubernetes CronJob to run periodic checks:
# Kubernetes CronJob example
apiVersion: batch/v1
kind: CronJob
metadata:
name: orchard-consistency-check
spec:
schedule: "0 2 * * *" # Daily at 2 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: check
image: curlimages/curl
command:
- /bin/sh
- -c
- |
response=$(curl -s "${ORCHARD_URL}/api/v1/admin/consistency-check")
healthy=$(echo "$response" | jq -r '.healthy')
if [ "$healthy" != "true" ]; then
echo "ALERT: Consistency check failed!"
echo "$response"
exit 1
fi
echo "Consistency check passed"
restartPolicy: OnFailure
Recovery Procedures
Corrupted Artifact (Size Mismatch)
If the consistency check reports size mismatches:
-
Identify affected artifacts:
curl "${base_url}/api/v1/admin/consistency-check" | jq '.size_mismatch_artifacts' -
Check if artifact can be re-uploaded:
- If the original content is available, delete the corrupted artifact and re-upload
- The same content will produce the same artifact ID
-
If original content is lost:
- The artifact data is corrupted and cannot be recovered
- Delete the artifact record and notify affected users
- Consider restoring from backup if available
Missing S3 Object
If database records exist but S3 objects are missing:
-
Identify affected artifacts:
curl "${base_url}/api/v1/admin/consistency-check" | jq '.missing_s3_keys' -
Check S3 bucket:
- Verify the S3 bucket exists and is accessible
- Check S3 access logs for deletion events
- Check if objects were moved or lifecycle-deleted
-
Recovery options:
- Restore from S3 versioning (if enabled)
- Restore from backup
- Re-upload original content (if available)
- Delete orphaned database records
Orphaned S3 Objects
If S3 objects exist without database records:
-
Identify orphaned objects:
curl "${base_url}/api/v1/admin/consistency-check" | jq '.orphaned_s3_keys' -
Investigate cause:
- Upload interrupted before database commit?
- Database record deleted but S3 cleanup failed?
-
Resolution:
- If content is needed, create database record manually
- If content is not needed, delete the S3 object to reclaim storage
Preventive Measures
- Enable S3 versioning to recover from accidental deletions
- Regular backups of both database and S3 bucket
- Scheduled consistency checks to detect issues early
- Monitoring and alerting on consistency check failures
- Audit logging to track all artifact operations
Verification in CI/CD
Verifying Artifacts in Pipelines
#!/bin/bash
# Download and verify artifact in CI pipeline
ARTIFACT_URL="${ORCHARD_URL}/api/v1/project/${PROJECT}/${PACKAGE}/+/${TAG}"
# Download with verification headers
response=$(curl -s -D - "${ARTIFACT_URL}?mode=proxy" -o artifact.tar.gz)
expected_hash=$(echo "$response" | grep -i "X-Checksum-SHA256" | cut -d: -f2 | tr -d ' \r')
# Compute actual hash
actual_hash=$(sha256sum artifact.tar.gz | cut -d' ' -f1)
# Verify
if [ "$actual_hash" != "$expected_hash" ]; then
echo "ERROR: Integrity check failed!"
echo "Expected: $expected_hash"
echo "Actual: $actual_hash"
exit 1
fi
echo "Integrity verified: $actual_hash"
Using Server-Side Verification
For critical deployments, use server-side pre-verification:
# Server verifies before streaming - returns 500 if corrupt
curl -f "${ARTIFACT_URL}?mode=proxy&verify=true&verify_mode=pre" -o artifact.tar.gz
This ensures the artifact is verified before any bytes are streamed to your pipeline.