Add ref_count management for deletions with atomic operations and error handling

2026-01-06 13:44:23 -06:00
parent 66622caf5d
commit 7e68baed08
24 changed files with 6888 additions and 329 deletions
--- a/docs/design/deduplication-design.md
+++ b/docs/design/deduplication-design.md
@@ -0,0 +1,575 @@
+# Deduplication Design Document
+
+This document defines Orchard's content-addressable storage and deduplication approach using SHA256 hashes.
+
+## Table of Contents
+
+1. [Overview](#overview)
+2. [Hash Algorithm Selection](#hash-algorithm-selection)
+3. [Content-Addressable Storage Model](#content-addressable-storage-model)
+4. [S3 Key Derivation](#s3-key-derivation)
+5. [Duplicate Detection Strategy](#duplicate-detection-strategy)
+6. [Reference Counting Lifecycle](#reference-counting-lifecycle)
+7. [Edge Cases and Error Handling](#edge-cases-and-error-handling)
+8. [Collision Handling](#collision-handling)
+9. [Performance Considerations](#performance-considerations)
+10. [Operations Runbook](#operations-runbook)
+
+---
+
+## Overview
+
+Orchard uses **whole-file deduplication** based on content hashing. When a file is uploaded:
+
+1. The SHA256 hash of the entire file content is computed
+2. The hash becomes the artifact's primary identifier
+3. If a file with the same hash already exists, no duplicate is stored
+4. Multiple tags/references can point to the same artifact
+
+**Scope:** Orchard implements whole-file deduplication only. Chunk-level or block-level deduplication is out of scope for MVP.
+
+---
+
+## Hash Algorithm Selection
+
+### Decision: SHA256
+
+| Criteria | SHA256 | SHA1 | MD5 | Blake3 |
+|----------|--------|------|-----|--------|
+| Security | Strong (256-bit) | Weak (broken) | Weak (broken) | Strong |
+| Speed | ~400 MB/s | ~600 MB/s | ~800 MB/s | ~1500 MB/s |
+| Collision Resistance | 2^128 | Broken | Broken | 2^128 |
+| Industry Adoption | Universal | Legacy | Legacy | Emerging |
+| Tool Ecosystem | Excellent | Good | Good | Growing |
+
+### Rationale
+
+1. **Security**: SHA256 has no known practical collision attacks. SHA1 and MD5 are cryptographically broken.
+
+2. **Collision Resistance**: With 256-bit output, the probability of accidental collision is approximately 2^-128 (~10^-38). To have a 50% chance of collision, you would need approximately 2^128 unique files.
+
+3. **Industry Standard**: SHA256 is the de facto standard for content-addressable storage (Git, Docker, npm, etc.).
+
+4. **Performance**: While Blake3 is faster, SHA256 throughput (~400 MB/s) exceeds typical network bandwidth for uploads. The bottleneck is I/O, not hashing.
+
+5. **Tooling**: Universal support in all languages, operating systems, and verification tools.
+
+### Migration Path
+
+If a future algorithm change is needed (e.g., SHA3 or Blake3):
+
+1. **Database**: Add `hash_algorithm` column to artifacts table (default: 'sha256')
+2. **S3 Keys**: New algorithm uses different prefix (e.g., `fruits-sha3/` vs `fruits/`)
+3. **API**: Accept algorithm hint in upload, return algorithm in responses
+4. **Migration**: Background job to re-hash existing artifacts if needed
+
+**Current Implementation**: Single algorithm (SHA256), no algorithm versioning required for MVP.
+
+---
+
+## Content-Addressable Storage Model
+
+### Core Principles
+
+1. **Identity = Content**: The artifact ID IS the SHA256 hash of its content
+2. **Immutability**: Content cannot change after storage (same hash = same content)
+3. **Deduplication**: Same content uploaded twice results in single storage
+4. **Metadata Independence**: Files with identical content but different names/types are deduplicated
+
+### Data Model
+
+```
+Artifact {
+    id: VARCHAR(64) PRIMARY KEY  -- SHA256 hash (lowercase hex)
+    size: BIGINT                 -- File size in bytes
+    ref_count: INTEGER           -- Number of references
+    s3_key: VARCHAR(1024)        -- S3 storage path
+    checksum_md5: VARCHAR(32)    -- Secondary checksum
+    checksum_sha1: VARCHAR(40)   -- Secondary checksum
+    ...
+}
+
+Tag {
+    id: UUID PRIMARY KEY
+    name: VARCHAR(255)
+    package_id: UUID FK
+    artifact_id: VARCHAR(64) FK  -- Points to Artifact.id (SHA256)
+}
+```
+
+### Hash Format
+
+- Algorithm: SHA256
+- Output: 64 lowercase hexadecimal characters
+- Example: `dffd6021bb2bd5b0af676290809ec3a53191dd81c7f70a4b28688a362182986f`
+
+---
+
+## S3 Key Derivation
+
+### Key Structure
+
+```
+fruits/{hash[0:2]}/{hash[2:4]}/{full_hash}
+```
+
+Example for hash `dffd6021bb2bd5b0...`:
+```
+fruits/df/fd/dffd6021bb2bd5b0af676290809ec3a53191dd81c7f70a4b28688a362182986f
+```
+
+### Rationale for Prefix Sharding
+
+1. **S3 Performance**: S3 partitions by key prefix. Distributing across prefixes improves throughput.
+
+2. **Filesystem Compatibility**: When using filesystem-backed storage, avoids single directory with millions of files.
+
+3. **Distribution**: With 2-character prefixes (256 combinations each level), provides 65,536 (256 x 256) top-level buckets.
+
+### Bucket Distribution Analysis
+
+Assuming uniformly distributed SHA256 hashes:
+
+| Artifacts | Files per Prefix (avg) | Max per Prefix (99.9%) |
+|-----------|------------------------|------------------------|
+| 100,000 | 1.5 | 10 |
+| 1,000,000 | 15 | 50 |
+| 10,000,000 | 152 | 250 |
+| 100,000,000 | 1,525 | 2,000 |
+
+The two-level prefix provides excellent distribution up to hundreds of millions of artifacts.
+
+---
+
+## Duplicate Detection Strategy
+
+### Upload Flow
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                        UPLOAD REQUEST                            │
+└─────────────────────────────────────────────────────────────────┘
+                              │
+                              ▼
+┌─────────────────────────────────────────────────────────────────┐
+│  1. VALIDATE: Check file size limits (min/max)                   │
+│     - Empty files (0 bytes) → Reject with 422                    │
+│     - Exceeds max_file_size → Reject with 413                    │
+└─────────────────────────────────────────────────────────────────┘
+                              │
+                              ▼
+┌─────────────────────────────────────────────────────────────────┐
+│  2. COMPUTE HASH: Stream file through SHA256/MD5/SHA1            │
+│     - Use 8MB chunks for memory efficiency                       │
+│     - Single pass for all three hashes                           │
+└─────────────────────────────────────────────────────────────────┘
+                              │
+                              ▼
+┌─────────────────────────────────────────────────────────────────┐
+│  3. DERIVE S3 KEY: fruits/{hash[0:2]}/{hash[2:4]}/{hash}        │
+└─────────────────────────────────────────────────────────────────┘
+                              │
+                              ▼
+┌─────────────────────────────────────────────────────────────────┐
+│  4. CHECK EXISTENCE: HEAD request to S3 for derived key          │
+│     - Retry up to 3 times on transient failures                  │
+└─────────────────────────────────────────────────────────────────┘
+                              │
+              ┌───────────────┴───────────────┐
+              ▼                               ▼
+┌─────────────────────────┐     ┌─────────────────────────────────┐
+│  EXISTS: Deduplicated    │     │  NOT EXISTS: Upload to S3       │
+│  - Verify size matches   │     │  - PUT object (or multipart)    │
+│  - Skip S3 upload        │     │  - Abort on failure             │
+│  - Log saved bytes       │     └─────────────────────────────────┘
+└─────────────────────────┘                   │
+              │                               │
+              └───────────────┬───────────────┘
+                              ▼
+┌─────────────────────────────────────────────────────────────────┐
+│  5. DATABASE: Create/update artifact record                      │
+│     - Use row locking to prevent race conditions                 │
+│     - ref_count managed by SQL triggers                          │
+└─────────────────────────────────────────────────────────────────┘
+                              │
+                              ▼
+┌─────────────────────────────────────────────────────────────────┐
+│  6. CREATE TAG: If tag provided, create/update tag               │
+│     - SQL trigger increments ref_count                           │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+### Hash Computation
+
+**Memory Requirements:**
+- Chunk size: 8MB (`HASH_CHUNK_SIZE`)
+- Working memory: ~25MB (8MB chunk + hash states)
+- Independent of file size (streaming)
+
+**Throughput:**
+- SHA256 alone: ~400 MB/s on modern CPU
+- With MD5 + SHA1: ~300 MB/s (parallel computation)
+- Typical bottleneck: Network I/O, not CPU
+
+### Multipart Upload Threshold
+
+Files larger than 100MB use S3 multipart upload:
+- First pass: Stream to compute hashes
+- If not duplicate: Seek to start, upload in 10MB parts
+- On failure: Abort multipart upload (no orphaned parts)
+
+---
+
+## Reference Counting Lifecycle
+
+### What Constitutes a "Reference"
+
+A reference is a **Tag** pointing to an artifact. Each tag increments the ref_count by 1.
+
+**Uploads do NOT directly increment ref_count** - only tag creation does.
+
+### Lifecycle
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│  CREATE: New artifact uploaded                                   │
+│  - ref_count = 0 (no tags yet)                                   │
+│  - Artifact exists but is "orphaned"                             │
+└─────────────────────────────────────────────────────────────────┘
+                              │
+                              ▼
+┌─────────────────────────────────────────────────────────────────┐
+│  TAG CREATED: Tag points to artifact                             │
+│  - SQL trigger: ref_count += 1                                   │
+│  - Artifact is now referenced                                    │
+└─────────────────────────────────────────────────────────────────┘
+                              │
+                              ▼
+┌─────────────────────────────────────────────────────────────────┐
+│  TAG UPDATED: Tag moved to different artifact                    │
+│  - SQL trigger on old artifact: ref_count -= 1                   │
+│  - SQL trigger on new artifact: ref_count += 1                   │
+└─────────────────────────────────────────────────────────────────┘
+                              │
+                              ▼
+┌─────────────────────────────────────────────────────────────────┐
+│  TAG DELETED: Tag removed                                        │
+│  - SQL trigger: ref_count -= 1                                   │
+│  - If ref_count = 0, artifact is orphaned                        │
+└─────────────────────────────────────────────────────────────────┘
+                              │
+                              ▼
+┌─────────────────────────────────────────────────────────────────┐
+│  GARBAGE COLLECTION: Clean up orphaned artifacts                 │
+│  - Triggered manually via admin endpoint                         │
+│  - Finds artifacts where ref_count = 0                           │
+│  - Deletes from S3 and database                                  │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+### SQL Triggers
+
+Three triggers manage ref_count automatically:
+
+1. **`tags_ref_count_insert_trigger`**: On tag INSERT, increment target artifact's ref_count
+2. **`tags_ref_count_delete_trigger`**: On tag DELETE, decrement target artifact's ref_count
+3. **`tags_ref_count_update_trigger`**: On tag UPDATE (artifact_id changed), decrement old, increment new
+
+### Garbage Collection
+
+**Trigger**: Manual admin endpoint (`POST /api/v1/admin/garbage-collect`)
+
+**Process**:
+1. Query artifacts where `ref_count = 0`
+2. For each orphan:
+   - Delete from S3 (`DELETE fruits/xx/yy/hash`)
+   - Delete from database
+   - Log deletion
+
+**Safety**: 
+- Dry-run mode by default (`?dry_run=true`)
+- Limit per run (`?limit=100`)
+- Check constraint prevents ref_count < 0
+
+---
+
+## Edge Cases and Error Handling
+
+### Empty Files
+
+- **Behavior**: Rejected with HTTP 422
+- **Reason**: Empty content has deterministic hash but provides no value
+- **Error**: "Empty files are not allowed"
+
+### Maximum File Size
+
+- **Default Limit**: 10GB (`ORCHARD_MAX_FILE_SIZE`)
+- **Configurable**: Via environment variable
+- **Behavior**: Rejected with HTTP 413 before upload begins
+- **Error**: "File too large. Maximum size is 10GB"
+
+### Concurrent Upload of Same Content
+
+**Race Condition Scenario**: Two clients upload identical content simultaneously.
+
+**Handling**:
+1. **S3 Level**: Both compute same hash, both check existence, both may upload
+2. **Database Level**: Row-level locking with `SELECT ... FOR UPDATE`
+3. **Outcome**: One creates artifact, other sees it exists, both succeed
+4. **Trigger Safety**: SQL triggers are atomic per row
+
+**No Data Corruption**: S3 is eventually consistent; identical content = identical result.
+
+### Upload Interrupted
+
+**Scenario**: Upload fails after hash computed but before S3 write completes.
+
+**Simple Upload**:
+- S3 put_object is atomic - either completes or fails entirely
+- No cleanup needed
+
+**Multipart Upload**:
+- On any failure, `abort_multipart_upload` is called
+- S3 cleans up partial parts
+- No orphaned data
+
+### DB Exists but S3 Missing
+
+**Detection**: Download request finds artifact in DB but S3 returns 404.
+
+**Current Behavior**: Return 500 error to client.
+
+**Recovery Options** (not yet implemented):
+1. Mark artifact for re-upload (set flag, notify admins)
+2. Decrement ref_count to trigger garbage collection
+3. Return specific error code for client retry
+
+**Recommended**: Log critical alert, return 503 with retry hint.
+
+### S3 Exists but DB Missing
+
+**Detection**: Orphan - file in S3 with no corresponding DB record.
+
+**Cause**: 
+- Failed transaction after S3 upload
+- Manual S3 manipulation
+- Database restore from backup
+
+**Recovery**: 
+- Garbage collection won't delete (no DB record to query)
+- Requires S3 bucket scan + DB reconciliation
+- Manual admin task (out of scope for MVP)
+
+### Network Timeout During Existence Check
+
+**Behavior**: Retry up to 3 times with adaptive backoff.
+
+**After Retries Exhausted**: Raise `S3ExistenceCheckError`, return 503 to client.
+
+**Rationale**: Don't upload without knowing if duplicate exists (prevents orphans).
+
+---
+
+## Collision Handling
+
+### SHA256 Collision Probability
+
+For random inputs, the probability of collision is:
+
+```
+P(collision) ≈ n² / 2^257
+
+Where n = number of unique files
+```
+
+| Files | Collision Probability |
+|-------|----------------------|
+| 10^9 (1 billion) | 10^-59 |
+| 10^12 (1 trillion) | 10^-53 |
+| 10^18 | 10^-41 |
+
+**Practical Assessment**: You would need to store more files than atoms in the observable universe to have meaningful collision risk.
+
+### Detection Mechanism
+
+Despite near-zero probability, we detect potential collisions by:
+
+1. **Size Comparison**: If hash matches but sizes differ, CRITICAL alert
+2. **ETag Verification**: S3 ETag provides secondary check
+
+### Handling Procedure
+
+If collision detected (size mismatch):
+
+1. **Log CRITICAL alert** with full details
+2. **Reject upload** with 500 error
+3. **Do NOT overwrite** existing content
+4. **Notify operations** for manual investigation
+
+```python
+raise HashCollisionError(
+    f"Hash collision detected for {sha256_hash}: size mismatch"
+)
+```
+
+### MVP Position
+
+For MVP, we:
+- Detect collisions via size mismatch
+- Log and alert on detection
+- Reject conflicting upload
+- Accept that true collisions are practically impossible
+
+No active mitigation (e.g., storing hash + size as composite key) is needed.
+
+---
+
+## Performance Considerations
+
+### Hash Computation Overhead
+
+| File Size | Hash Time | Upload Time (100 Mbps) | Overhead |
+|-----------|-----------|------------------------|----------|
+| 10 MB | 25ms | 800ms | 3% |
+| 100 MB | 250ms | 8s | 3% |
+| 1 GB | 2.5s | 80s | 3% |
+| 10 GB | 25s | 800s | 3% |
+
+**Conclusion**: Hash computation adds ~3% overhead regardless of file size. Network I/O dominates.
+
+### Existence Check Overhead
+
+- S3 HEAD request: ~50-100ms per call
+- Cached in future: Could use Redis/memory cache for hot paths
+- Current MVP: No caching (acceptable for expected load)
+
+### Deduplication Savings
+
+Example with 50% duplication rate:
+
+| Metric | Without Dedup | With Dedup | Savings |
+|--------|---------------|------------|---------|
+| Storage (100K files, 10MB avg) | 1 TB | 500 GB | 50% |
+| Upload bandwidth | 1 TB | 500 GB | 50% |
+| S3 costs | $23/mo | $11.50/mo | 50% |
+
+---
+
+## Operations Runbook
+
+### Monitoring Deduplication
+
+```bash
+# View deduplication stats
+curl http://orchard:8080/api/v1/stats/deduplication
+
+# Response includes:
+# - deduplication_ratio
+# - total_uploads, deduplicated_uploads
+# - bytes_saved
+```
+
+### Checking for Orphaned Artifacts
+
+```bash
+# List orphaned artifacts (ref_count = 0)
+curl http://orchard:8080/api/v1/admin/orphaned-artifacts
+
+# Dry-run garbage collection
+curl -X POST "http://orchard:8080/api/v1/admin/garbage-collect?dry_run=true"
+
+# Execute garbage collection
+curl -X POST "http://orchard:8080/api/v1/admin/garbage-collect?dry_run=false"
+```
+
+### Verifying Artifact Integrity
+
+```bash
+# Download and verify hash matches artifact ID
+ARTIFACT_ID="dffd6021bb2bd5b0af676290809ec3a53191dd81c7f70a4b28688a362182986f"
+curl -O http://orchard:8080/api/v1/artifact/$ARTIFACT_ID/download
+COMPUTED=$(sha256sum downloaded_file | cut -d' ' -f1)
+[ "$ARTIFACT_ID" = "$COMPUTED" ] && echo "OK" || echo "INTEGRITY FAILURE"
+```
+
+### Troubleshooting
+
+| Symptom | Likely Cause | Resolution |
+|---------|--------------|------------|
+| "Hash computation error" | Empty file or read error | Check file content, retry |
+| "Storage unavailable" | S3/MinIO down | Check S3 health, retry |
+| "File too large" | Exceeds max_file_size | Adjust config or use chunked upload |
+| "Hash collision detected" | Extremely rare | Investigate, do not ignore |
+| Orphaned artifacts accumulating | Tags deleted, no GC run | Run garbage collection |
+| Download returns 404 | S3 object missing | Check S3 bucket, restore from backup |
+
+### Configuration Reference
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `ORCHARD_MAX_FILE_SIZE` | 10GB | Maximum upload size |
+| `ORCHARD_MIN_FILE_SIZE` | 1 | Minimum upload size (rejects empty) |
+| `ORCHARD_S3_MAX_RETRIES` | 3 | Retry attempts for S3 operations |
+| `ORCHARD_S3_CONNECT_TIMEOUT` | 10s | S3 connection timeout |
+| `ORCHARD_S3_READ_TIMEOUT` | 60s | S3 read timeout |
+
+---
+
+## Appendix: Decision Records
+
+### ADR-001: SHA256 for Content Hashing
+
+**Status**: Accepted
+
+**Context**: Need deterministic content identifier for deduplication.
+
+**Decision**: Use SHA256.
+
+**Rationale**: 
+- Cryptographically strong (no known attacks)
+- Universal adoption (Git, Docker, npm)
+- Sufficient speed for I/O-bound workloads
+- Excellent tooling
+
+**Consequences**:
+- 64-character artifact IDs (longer than UUIDs)
+- CPU overhead ~3% of upload time
+- Future algorithm migration requires versioning
+
+### ADR-002: Whole-File Deduplication Only
+
+**Status**: Accepted
+
+**Context**: Could implement chunk-level deduplication for better savings.
+
+**Decision**: Whole-file only for MVP.
+
+**Rationale**:
+- Simpler implementation
+- No chunking algorithm complexity
+- Sufficient for build artifact use case
+- Can add chunk-level later if needed
+
+**Consequences**:
+- Files with partial overlap stored entirely
+- Large files with small changes not deduplicated
+- Acceptable for binary artifact workloads
+
+### ADR-003: SQL Triggers for ref_count
+
+**Status**: Accepted
+
+**Context**: ref_count must be accurate for garbage collection.
+
+**Decision**: Use PostgreSQL triggers, not application code.
+
+**Rationale**:
+- Atomic with tag operations
+- Cannot be bypassed
+- Works regardless of client (API, direct SQL, migrations)
+- Simpler application code
+
+**Consequences**:
+- Trigger logic in SQL (less visible)
+- Must maintain triggers across schema changes
+- Debugging requires database access