# Integrity Verification

Orchard uses content-addressable storage with SHA256 hashing to ensure artifact integrity. This document describes how integrity verification works and how to use it.

## How It Works

### Content-Addressable Storage

Orchard stores artifacts using their SHA256 hash as the unique identifier. This provides several benefits:

1. **Automatic deduplication**: Identical content is stored only once
2. **Built-in integrity**: The artifact ID *is* the content hash
3. **Tamper detection**: Any modification changes the hash, making corruption detectable

When you upload a file:
1. Orchard computes the SHA256 hash of the content
2. The hash becomes the artifact ID (64-character hex string)
3. The file is stored in S3 at `fruits/{hash[0:2]}/{hash[2:4]}/{hash}`
4. The hash and metadata are recorded in the database

### Hash Format

- Algorithm: SHA256
- Format: 64-character lowercase hexadecimal string
- Example: `dffd6021bb2bd5b0af676290809ec3a53191dd81c7f70a4b28688a362182986f`

## Client-Side Verification

### Before Upload

Compute the hash locally before uploading to verify the server received your content correctly:

```python
import hashlib

def compute_sha256(content: bytes) -> str:
    return hashlib.sha256(content).hexdigest()

# Compute hash before upload
content = open("myfile.tar.gz", "rb").read()
local_hash = compute_sha256(content)

# Upload the file
response = requests.post(
    f"{base_url}/api/v1/project/{project}/{package}/upload",
    files={"file": ("myfile.tar.gz", content)},
)
result = response.json()

# Verify server computed the same hash
assert result["artifact_id"] == local_hash, "Hash mismatch!"
```

### Providing Expected Hash on Upload

You can provide the expected hash in the upload request. The server will reject the upload if the computed hash doesn't match:

```python
response = requests.post(
    f"{base_url}/api/v1/project/{project}/{package}/upload",
    files={"file": ("myfile.tar.gz", content)},
    headers={"X-Checksum-SHA256": local_hash},
)

# Returns 422 if hash doesn't match
if response.status_code == 422:
    print("Checksum mismatch - upload rejected")
```

### After Download

Verify downloaded content matches the expected hash using response headers:

```python
response = requests.get(
    f"{base_url}/api/v1/project/{project}/{package}/+/{tag}",
    params={"mode": "proxy"},
)

# Get expected hash from header
expected_hash = response.headers.get("X-Checksum-SHA256")

# Compute hash of downloaded content
actual_hash = compute_sha256(response.content)

# Verify
if actual_hash != expected_hash:
    raise Exception(f"Integrity check failed! Expected {expected_hash}, got {actual_hash}")
```

### Response Headers for Verification

Download responses include multiple headers for verification:

| Header | Format | Description |
|--------|--------|-------------|
| `X-Checksum-SHA256` | Hex string | SHA256 hash (64 chars) |
| `ETag` | `"<hash>"` | SHA256 hash in quotes |
| `Digest` | `sha-256=<base64>` | RFC 3230 format (base64-encoded) |
| `Content-Length` | Integer | File size in bytes |

### Server-Side Verification on Download

Request server-side verification during download:

```bash
# Pre-verification: Server verifies before streaming (returns 500 if corrupt)
curl "${base_url}/api/v1/project/${project}/${package}/+/${tag}?mode=proxy&verify=true&verify_mode=pre"

# Stream verification: Server verifies while streaming (logs error if corrupt)
curl "${base_url}/api/v1/project/${project}/${package}/+/${tag}?mode=proxy&verify=true&verify_mode=stream"
```

The `X-Verified` header indicates whether server-side verification was performed:
- `X-Verified: true` - Content was verified by the server

## Server-Side Consistency Check

### Consistency Check Endpoint

Administrators can run a consistency check to verify all stored artifacts:

```bash
curl "${base_url}/api/v1/admin/consistency-check"
```

Response:
```json
{
  "total_artifacts_checked": 1234,
  "healthy": true,
  "orphaned_s3_objects": 0,
  "missing_s3_objects": 0,
  "size_mismatches": 0,
  "orphaned_s3_keys": [],
  "missing_s3_keys": [],
  "size_mismatch_artifacts": []
}
```

### What the Check Verifies

1. **Missing S3 objects**: Database records with no corresponding S3 object
2. **Orphaned S3 objects**: S3 objects with no database record
3. **Size mismatches**: S3 object size doesn't match database record

### Running Consistency Checks

**Manual check:**
```bash
# Check all artifacts
curl "${base_url}/api/v1/admin/consistency-check"

# Limit results (for large deployments)
curl "${base_url}/api/v1/admin/consistency-check?limit=100"
```

**Scheduled checks (recommended):**

Set up a cron job or Kubernetes CronJob to run periodic checks:

```yaml
# Kubernetes CronJob example
apiVersion: batch/v1
kind: CronJob
metadata:
  name: orchard-consistency-check
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: check
            image: curlimages/curl
            command:
            - /bin/sh
            - -c
            - |
              response=$(curl -s "${ORCHARD_URL}/api/v1/admin/consistency-check")
              healthy=$(echo "$response" | jq -r '.healthy')
              if [ "$healthy" != "true" ]; then
                echo "ALERT: Consistency check failed!"
                echo "$response"
                exit 1
              fi
              echo "Consistency check passed"
          restartPolicy: OnFailure
```

## Recovery Procedures

### Corrupted Artifact (Size Mismatch)

If the consistency check reports size mismatches:

1. **Identify affected artifacts:**
   ```bash
   curl "${base_url}/api/v1/admin/consistency-check" | jq '.size_mismatch_artifacts'
   ```

2. **Check if artifact can be re-uploaded:**
   - If the original content is available, delete the corrupted artifact and re-upload
   - The same content will produce the same artifact ID

3. **If original content is lost:**
   - The artifact data is corrupted and cannot be recovered
   - Delete the artifact record and notify affected users
   - Consider restoring from backup if available

### Missing S3 Object

If database records exist but S3 objects are missing:

1. **Identify affected artifacts:**
   ```bash
   curl "${base_url}/api/v1/admin/consistency-check" | jq '.missing_s3_keys'
   ```

2. **Check S3 bucket:**
   - Verify the S3 bucket exists and is accessible
   - Check S3 access logs for deletion events
   - Check if objects were moved or lifecycle-deleted

3. **Recovery options:**
   - Restore from S3 versioning (if enabled)
   - Restore from backup
   - Re-upload original content (if available)
   - Delete orphaned database records

### Orphaned S3 Objects

If S3 objects exist without database records:

1. **Identify orphaned objects:**
   ```bash
   curl "${base_url}/api/v1/admin/consistency-check" | jq '.orphaned_s3_keys'
   ```

2. **Investigate cause:**
   - Upload interrupted before database commit?
   - Database record deleted but S3 cleanup failed?

3. **Resolution:**
   - If content is needed, create database record manually
   - If content is not needed, delete the S3 object to reclaim storage

### Preventive Measures

1. **Enable S3 versioning** to recover from accidental deletions
2. **Regular backups** of both database and S3 bucket
3. **Scheduled consistency checks** to detect issues early
4. **Monitoring and alerting** on consistency check failures
5. **Audit logging** to track all artifact operations

## Verification in CI/CD

### Verifying Artifacts in Pipelines

```bash
#!/bin/bash
# Download and verify artifact in CI pipeline

ARTIFACT_URL="${ORCHARD_URL}/api/v1/project/${PROJECT}/${PACKAGE}/+/${TAG}"

# Download with verification headers
response=$(curl -s -D - "${ARTIFACT_URL}?mode=proxy" -o artifact.tar.gz)
expected_hash=$(echo "$response" | grep -i "X-Checksum-SHA256" | cut -d: -f2 | tr -d ' \r')

# Compute actual hash
actual_hash=$(sha256sum artifact.tar.gz | cut -d' ' -f1)

# Verify
if [ "$actual_hash" != "$expected_hash" ]; then
    echo "ERROR: Integrity check failed!"
    echo "Expected: $expected_hash"
    echo "Actual:   $actual_hash"
    exit 1
fi

echo "Integrity verified: $actual_hash"
```

### Using Server-Side Verification

For critical deployments, use server-side pre-verification:

```bash
# Server verifies before streaming - returns 500 if corrupt
curl -f "${ARTIFACT_URL}?mode=proxy&verify=true&verify_mode=pre" -o artifact.tar.gz
```

This ensures the artifact is verified before any bytes are streamed to your pipeline.