Add integrity verification documentation
Document how content-addressable storage and integrity verification works: - SHA256 hashing and content-addressable storage overview - Client-side verification steps (before upload, after download) - Server-side consistency check endpoint and scheduling - Recovery procedures for corrupted, missing, or orphaned artifacts - CI/CD integration examples
This commit is contained in:
@@ -23,6 +23,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
||||
- Added consistency check endpoint tests with response format validation
|
||||
- Added corruption detection tests: bit flip, truncation, appended content, size mismatch, missing S3 objects
|
||||
- Added Digest header tests (RFC 3230) and verification mode tests
|
||||
- Added integrity verification documentation (`docs/integrity-verification.md`)
|
||||
- Added `package_versions` table for immutable version tracking separate from mutable tags (#56)
|
||||
- Versions are set at upload time via explicit `version` parameter or auto-detected from filename/metadata
|
||||
- Version detection priority: explicit parameter > package metadata > filename pattern
|
||||
|
||||
294
docs/integrity-verification.md
Normal file
294
docs/integrity-verification.md
Normal file
@@ -0,0 +1,294 @@
|
||||
# Integrity Verification
|
||||
|
||||
Orchard uses content-addressable storage with SHA256 hashing to ensure artifact integrity. This document describes how integrity verification works and how to use it.
|
||||
|
||||
## How It Works
|
||||
|
||||
### Content-Addressable Storage
|
||||
|
||||
Orchard stores artifacts using their SHA256 hash as the unique identifier. This provides several benefits:
|
||||
|
||||
1. **Automatic deduplication**: Identical content is stored only once
|
||||
2. **Built-in integrity**: The artifact ID *is* the content hash
|
||||
3. **Tamper detection**: Any modification changes the hash, making corruption detectable
|
||||
|
||||
When you upload a file:
|
||||
1. Orchard computes the SHA256 hash of the content
|
||||
2. The hash becomes the artifact ID (64-character hex string)
|
||||
3. The file is stored in S3 at `fruits/{hash[0:2]}/{hash[2:4]}/{hash}`
|
||||
4. The hash and metadata are recorded in the database
|
||||
|
||||
### Hash Format
|
||||
|
||||
- Algorithm: SHA256
|
||||
- Format: 64-character lowercase hexadecimal string
|
||||
- Example: `dffd6021bb2bd5b0af676290809ec3a53191dd81c7f70a4b28688a362182986f`
|
||||
|
||||
## Client-Side Verification
|
||||
|
||||
### Before Upload
|
||||
|
||||
Compute the hash locally before uploading to verify the server received your content correctly:
|
||||
|
||||
```python
|
||||
import hashlib
|
||||
|
||||
def compute_sha256(content: bytes) -> str:
|
||||
return hashlib.sha256(content).hexdigest()
|
||||
|
||||
# Compute hash before upload
|
||||
content = open("myfile.tar.gz", "rb").read()
|
||||
local_hash = compute_sha256(content)
|
||||
|
||||
# Upload the file
|
||||
response = requests.post(
|
||||
f"{base_url}/api/v1/project/{project}/{package}/upload",
|
||||
files={"file": ("myfile.tar.gz", content)},
|
||||
)
|
||||
result = response.json()
|
||||
|
||||
# Verify server computed the same hash
|
||||
assert result["artifact_id"] == local_hash, "Hash mismatch!"
|
||||
```
|
||||
|
||||
### Providing Expected Hash on Upload
|
||||
|
||||
You can provide the expected hash in the upload request. The server will reject the upload if the computed hash doesn't match:
|
||||
|
||||
```python
|
||||
response = requests.post(
|
||||
f"{base_url}/api/v1/project/{project}/{package}/upload",
|
||||
files={"file": ("myfile.tar.gz", content)},
|
||||
headers={"X-Checksum-SHA256": local_hash},
|
||||
)
|
||||
|
||||
# Returns 422 if hash doesn't match
|
||||
if response.status_code == 422:
|
||||
print("Checksum mismatch - upload rejected")
|
||||
```
|
||||
|
||||
### After Download
|
||||
|
||||
Verify downloaded content matches the expected hash using response headers:
|
||||
|
||||
```python
|
||||
response = requests.get(
|
||||
f"{base_url}/api/v1/project/{project}/{package}/+/{tag}",
|
||||
params={"mode": "proxy"},
|
||||
)
|
||||
|
||||
# Get expected hash from header
|
||||
expected_hash = response.headers.get("X-Checksum-SHA256")
|
||||
|
||||
# Compute hash of downloaded content
|
||||
actual_hash = compute_sha256(response.content)
|
||||
|
||||
# Verify
|
||||
if actual_hash != expected_hash:
|
||||
raise Exception(f"Integrity check failed! Expected {expected_hash}, got {actual_hash}")
|
||||
```
|
||||
|
||||
### Response Headers for Verification
|
||||
|
||||
Download responses include multiple headers for verification:
|
||||
|
||||
| Header | Format | Description |
|
||||
|--------|--------|-------------|
|
||||
| `X-Checksum-SHA256` | Hex string | SHA256 hash (64 chars) |
|
||||
| `ETag` | `"<hash>"` | SHA256 hash in quotes |
|
||||
| `Digest` | `sha-256=<base64>` | RFC 3230 format (base64-encoded) |
|
||||
| `Content-Length` | Integer | File size in bytes |
|
||||
|
||||
### Server-Side Verification on Download
|
||||
|
||||
Request server-side verification during download:
|
||||
|
||||
```bash
|
||||
# Pre-verification: Server verifies before streaming (returns 500 if corrupt)
|
||||
curl "${base_url}/api/v1/project/${project}/${package}/+/${tag}?mode=proxy&verify=true&verify_mode=pre"
|
||||
|
||||
# Stream verification: Server verifies while streaming (logs error if corrupt)
|
||||
curl "${base_url}/api/v1/project/${project}/${package}/+/${tag}?mode=proxy&verify=true&verify_mode=stream"
|
||||
```
|
||||
|
||||
The `X-Verified` header indicates whether server-side verification was performed:
|
||||
- `X-Verified: true` - Content was verified by the server
|
||||
|
||||
## Server-Side Consistency Check
|
||||
|
||||
### Consistency Check Endpoint
|
||||
|
||||
Administrators can run a consistency check to verify all stored artifacts:
|
||||
|
||||
```bash
|
||||
curl "${base_url}/api/v1/admin/consistency-check"
|
||||
```
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"total_artifacts_checked": 1234,
|
||||
"healthy": true,
|
||||
"orphaned_s3_objects": 0,
|
||||
"missing_s3_objects": 0,
|
||||
"size_mismatches": 0,
|
||||
"orphaned_s3_keys": [],
|
||||
"missing_s3_keys": [],
|
||||
"size_mismatch_artifacts": []
|
||||
}
|
||||
```
|
||||
|
||||
### What the Check Verifies
|
||||
|
||||
1. **Missing S3 objects**: Database records with no corresponding S3 object
|
||||
2. **Orphaned S3 objects**: S3 objects with no database record
|
||||
3. **Size mismatches**: S3 object size doesn't match database record
|
||||
|
||||
### Running Consistency Checks
|
||||
|
||||
**Manual check:**
|
||||
```bash
|
||||
# Check all artifacts
|
||||
curl "${base_url}/api/v1/admin/consistency-check"
|
||||
|
||||
# Limit results (for large deployments)
|
||||
curl "${base_url}/api/v1/admin/consistency-check?limit=100"
|
||||
```
|
||||
|
||||
**Scheduled checks (recommended):**
|
||||
|
||||
Set up a cron job or Kubernetes CronJob to run periodic checks:
|
||||
|
||||
```yaml
|
||||
# Kubernetes CronJob example
|
||||
apiVersion: batch/v1
|
||||
kind: CronJob
|
||||
metadata:
|
||||
name: orchard-consistency-check
|
||||
spec:
|
||||
schedule: "0 2 * * *" # Daily at 2 AM
|
||||
jobTemplate:
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: check
|
||||
image: curlimages/curl
|
||||
command:
|
||||
- /bin/sh
|
||||
- -c
|
||||
- |
|
||||
response=$(curl -s "${ORCHARD_URL}/api/v1/admin/consistency-check")
|
||||
healthy=$(echo "$response" | jq -r '.healthy')
|
||||
if [ "$healthy" != "true" ]; then
|
||||
echo "ALERT: Consistency check failed!"
|
||||
echo "$response"
|
||||
exit 1
|
||||
fi
|
||||
echo "Consistency check passed"
|
||||
restartPolicy: OnFailure
|
||||
```
|
||||
|
||||
## Recovery Procedures
|
||||
|
||||
### Corrupted Artifact (Size Mismatch)
|
||||
|
||||
If the consistency check reports size mismatches:
|
||||
|
||||
1. **Identify affected artifacts:**
|
||||
```bash
|
||||
curl "${base_url}/api/v1/admin/consistency-check" | jq '.size_mismatch_artifacts'
|
||||
```
|
||||
|
||||
2. **Check if artifact can be re-uploaded:**
|
||||
- If the original content is available, delete the corrupted artifact and re-upload
|
||||
- The same content will produce the same artifact ID
|
||||
|
||||
3. **If original content is lost:**
|
||||
- The artifact data is corrupted and cannot be recovered
|
||||
- Delete the artifact record and notify affected users
|
||||
- Consider restoring from backup if available
|
||||
|
||||
### Missing S3 Object
|
||||
|
||||
If database records exist but S3 objects are missing:
|
||||
|
||||
1. **Identify affected artifacts:**
|
||||
```bash
|
||||
curl "${base_url}/api/v1/admin/consistency-check" | jq '.missing_s3_keys'
|
||||
```
|
||||
|
||||
2. **Check S3 bucket:**
|
||||
- Verify the S3 bucket exists and is accessible
|
||||
- Check S3 access logs for deletion events
|
||||
- Check if objects were moved or lifecycle-deleted
|
||||
|
||||
3. **Recovery options:**
|
||||
- Restore from S3 versioning (if enabled)
|
||||
- Restore from backup
|
||||
- Re-upload original content (if available)
|
||||
- Delete orphaned database records
|
||||
|
||||
### Orphaned S3 Objects
|
||||
|
||||
If S3 objects exist without database records:
|
||||
|
||||
1. **Identify orphaned objects:**
|
||||
```bash
|
||||
curl "${base_url}/api/v1/admin/consistency-check" | jq '.orphaned_s3_keys'
|
||||
```
|
||||
|
||||
2. **Investigate cause:**
|
||||
- Upload interrupted before database commit?
|
||||
- Database record deleted but S3 cleanup failed?
|
||||
|
||||
3. **Resolution:**
|
||||
- If content is needed, create database record manually
|
||||
- If content is not needed, delete the S3 object to reclaim storage
|
||||
|
||||
### Preventive Measures
|
||||
|
||||
1. **Enable S3 versioning** to recover from accidental deletions
|
||||
2. **Regular backups** of both database and S3 bucket
|
||||
3. **Scheduled consistency checks** to detect issues early
|
||||
4. **Monitoring and alerting** on consistency check failures
|
||||
5. **Audit logging** to track all artifact operations
|
||||
|
||||
## Verification in CI/CD
|
||||
|
||||
### Verifying Artifacts in Pipelines
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Download and verify artifact in CI pipeline
|
||||
|
||||
ARTIFACT_URL="${ORCHARD_URL}/api/v1/project/${PROJECT}/${PACKAGE}/+/${TAG}"
|
||||
|
||||
# Download with verification headers
|
||||
response=$(curl -s -D - "${ARTIFACT_URL}?mode=proxy" -o artifact.tar.gz)
|
||||
expected_hash=$(echo "$response" | grep -i "X-Checksum-SHA256" | cut -d: -f2 | tr -d ' \r')
|
||||
|
||||
# Compute actual hash
|
||||
actual_hash=$(sha256sum artifact.tar.gz | cut -d' ' -f1)
|
||||
|
||||
# Verify
|
||||
if [ "$actual_hash" != "$expected_hash" ]; then
|
||||
echo "ERROR: Integrity check failed!"
|
||||
echo "Expected: $expected_hash"
|
||||
echo "Actual: $actual_hash"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "Integrity verified: $actual_hash"
|
||||
```
|
||||
|
||||
### Using Server-Side Verification
|
||||
|
||||
For critical deployments, use server-side pre-verification:
|
||||
|
||||
```bash
|
||||
# Server verifies before streaming - returns 500 if corrupt
|
||||
curl -f "${ARTIFACT_URL}?mode=proxy&verify=true&verify_mode=pre" -o artifact.tar.gz
|
||||
```
|
||||
|
||||
This ensures the artifact is verified before any bytes are streamed to your pipeline.
|
||||
Reference in New Issue
Block a user