Files
orchard/docs/design/deduplication-design.md

23 KiB

Deduplication Design Document

This document defines Orchard's content-addressable storage and deduplication approach using SHA256 hashes.

Table of Contents

  1. Overview
  2. Hash Algorithm Selection
  3. Content-Addressable Storage Model
  4. S3 Key Derivation
  5. Duplicate Detection Strategy
  6. Reference Counting Lifecycle
  7. Edge Cases and Error Handling
  8. Collision Handling
  9. Performance Considerations
  10. Operations Runbook

Overview

Orchard uses whole-file deduplication based on content hashing. When a file is uploaded:

  1. The SHA256 hash of the entire file content is computed
  2. The hash becomes the artifact's primary identifier
  3. If a file with the same hash already exists, no duplicate is stored
  4. Multiple tags/references can point to the same artifact

Scope: Orchard implements whole-file deduplication only. Chunk-level or block-level deduplication is out of scope for MVP.


Hash Algorithm Selection

Decision: SHA256

Criteria SHA256 SHA1 MD5 Blake3
Security Strong (256-bit) Weak (broken) Weak (broken) Strong
Speed ~400 MB/s ~600 MB/s ~800 MB/s ~1500 MB/s
Collision Resistance 2^128 Broken Broken 2^128
Industry Adoption Universal Legacy Legacy Emerging
Tool Ecosystem Excellent Good Good Growing

Rationale

  1. Security: SHA256 has no known practical collision attacks. SHA1 and MD5 are cryptographically broken.

  2. Collision Resistance: With 256-bit output, the probability of accidental collision is approximately 2^-128 (~10^-38). To have a 50% chance of collision, you would need approximately 2^128 unique files.

  3. Industry Standard: SHA256 is the de facto standard for content-addressable storage (Git, Docker, npm, etc.).

  4. Performance: While Blake3 is faster, SHA256 throughput (~400 MB/s) exceeds typical network bandwidth for uploads. The bottleneck is I/O, not hashing.

  5. Tooling: Universal support in all languages, operating systems, and verification tools.

Migration Path

If a future algorithm change is needed (e.g., SHA3 or Blake3):

  1. Database: Add hash_algorithm column to artifacts table (default: 'sha256')
  2. S3 Keys: New algorithm uses different prefix (e.g., fruits-sha3/ vs fruits/)
  3. API: Accept algorithm hint in upload, return algorithm in responses
  4. Migration: Background job to re-hash existing artifacts if needed

Current Implementation: Single algorithm (SHA256), no algorithm versioning required for MVP.


Content-Addressable Storage Model

Core Principles

  1. Identity = Content: The artifact ID IS the SHA256 hash of its content
  2. Immutability: Content cannot change after storage (same hash = same content)
  3. Deduplication: Same content uploaded twice results in single storage
  4. Metadata Independence: Files with identical content but different names/types are deduplicated

Data Model

Artifact {
    id: VARCHAR(64) PRIMARY KEY  -- SHA256 hash (lowercase hex)
    size: BIGINT                 -- File size in bytes
    ref_count: INTEGER           -- Number of references
    s3_key: VARCHAR(1024)        -- S3 storage path
    checksum_md5: VARCHAR(32)    -- Secondary checksum
    checksum_sha1: VARCHAR(40)   -- Secondary checksum
    ...
}

Tag {
    id: UUID PRIMARY KEY
    name: VARCHAR(255)
    package_id: UUID FK
    artifact_id: VARCHAR(64) FK  -- Points to Artifact.id (SHA256)
}

Hash Format

  • Algorithm: SHA256
  • Output: 64 lowercase hexadecimal characters
  • Example: dffd6021bb2bd5b0af676290809ec3a53191dd81c7f70a4b28688a362182986f

S3 Key Derivation

Key Structure

fruits/{hash[0:2]}/{hash[2:4]}/{full_hash}

Example for hash dffd6021bb2bd5b0...:

fruits/df/fd/dffd6021bb2bd5b0af676290809ec3a53191dd81c7f70a4b28688a362182986f

Rationale for Prefix Sharding

  1. S3 Performance: S3 partitions by key prefix. Distributing across prefixes improves throughput.

  2. Filesystem Compatibility: When using filesystem-backed storage, avoids single directory with millions of files.

  3. Distribution: With 2-character prefixes (256 combinations each level), provides 65,536 (256 x 256) top-level buckets.

Bucket Distribution Analysis

Assuming uniformly distributed SHA256 hashes:

Artifacts Files per Prefix (avg) Max per Prefix (99.9%)
100,000 1.5 10
1,000,000 15 50
10,000,000 152 250
100,000,000 1,525 2,000

The two-level prefix provides excellent distribution up to hundreds of millions of artifacts.


Duplicate Detection Strategy

Upload Flow

┌─────────────────────────────────────────────────────────────────┐
│                        UPLOAD REQUEST                            │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  1. VALIDATE: Check file size limits (min/max)                   │
│     - Empty files (0 bytes) → Reject with 422                    │
│     - Exceeds max_file_size → Reject with 413                    │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  2. COMPUTE HASH: Stream file through SHA256/MD5/SHA1            │
│     - Use 8MB chunks for memory efficiency                       │
│     - Single pass for all three hashes                           │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  3. DERIVE S3 KEY: fruits/{hash[0:2]}/{hash[2:4]}/{hash}        │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  4. CHECK EXISTENCE: HEAD request to S3 for derived key          │
│     - Retry up to 3 times on transient failures                  │
└─────────────────────────────────────────────────────────────────┘
                              │
              ┌───────────────┴───────────────┐
              ▼                               ▼
┌─────────────────────────┐     ┌─────────────────────────────────┐
│  EXISTS: Deduplicated    │     │  NOT EXISTS: Upload to S3       │
│  - Verify size matches   │     │  - PUT object (or multipart)    │
│  - Skip S3 upload        │     │  - Abort on failure             │
│  - Log saved bytes       │     └─────────────────────────────────┘
└─────────────────────────┘                   │
              │                               │
              └───────────────┬───────────────┘
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  5. DATABASE: Create/update artifact record                      │
│     - Use row locking to prevent race conditions                 │
│     - ref_count managed by SQL triggers                          │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  6. CREATE TAG: If tag provided, create/update tag               │
│     - SQL trigger increments ref_count                           │
└─────────────────────────────────────────────────────────────────┘

Hash Computation

Memory Requirements:

  • Chunk size: 8MB (HASH_CHUNK_SIZE)
  • Working memory: ~25MB (8MB chunk + hash states)
  • Independent of file size (streaming)

Throughput:

  • SHA256 alone: ~400 MB/s on modern CPU
  • With MD5 + SHA1: ~300 MB/s (parallel computation)
  • Typical bottleneck: Network I/O, not CPU

Multipart Upload Threshold

Files larger than 100MB use S3 multipart upload:

  • First pass: Stream to compute hashes
  • If not duplicate: Seek to start, upload in 10MB parts
  • On failure: Abort multipart upload (no orphaned parts)

Reference Counting Lifecycle

What Constitutes a "Reference"

A reference is a Tag pointing to an artifact. Each tag increments the ref_count by 1.

Uploads do NOT directly increment ref_count - only tag creation does.

Lifecycle

┌─────────────────────────────────────────────────────────────────┐
│  CREATE: New artifact uploaded                                   │
│  - ref_count = 0 (no tags yet)                                   │
│  - Artifact exists but is "orphaned"                             │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  TAG CREATED: Tag points to artifact                             │
│  - SQL trigger: ref_count += 1                                   │
│  - Artifact is now referenced                                    │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  TAG UPDATED: Tag moved to different artifact                    │
│  - SQL trigger on old artifact: ref_count -= 1                   │
│  - SQL trigger on new artifact: ref_count += 1                   │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  TAG DELETED: Tag removed                                        │
│  - SQL trigger: ref_count -= 1                                   │
│  - If ref_count = 0, artifact is orphaned                        │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  GARBAGE COLLECTION: Clean up orphaned artifacts                 │
│  - Triggered manually via admin endpoint                         │
│  - Finds artifacts where ref_count = 0                           │
│  - Deletes from S3 and database                                  │
└─────────────────────────────────────────────────────────────────┘

SQL Triggers

Three triggers manage ref_count automatically:

  1. tags_ref_count_insert_trigger: On tag INSERT, increment target artifact's ref_count
  2. tags_ref_count_delete_trigger: On tag DELETE, decrement target artifact's ref_count
  3. tags_ref_count_update_trigger: On tag UPDATE (artifact_id changed), decrement old, increment new

Garbage Collection

Trigger: Manual admin endpoint (POST /api/v1/admin/garbage-collect)

Process:

  1. Query artifacts where ref_count = 0
  2. For each orphan:
    • Delete from S3 (DELETE fruits/xx/yy/hash)
    • Delete from database
    • Log deletion

Safety:

  • Dry-run mode by default (?dry_run=true)
  • Limit per run (?limit=100)
  • Check constraint prevents ref_count < 0

Edge Cases and Error Handling

Empty Files

  • Behavior: Rejected with HTTP 422
  • Reason: Empty content has deterministic hash but provides no value
  • Error: "Empty files are not allowed"

Maximum File Size

  • Default Limit: 10GB (ORCHARD_MAX_FILE_SIZE)
  • Configurable: Via environment variable
  • Behavior: Rejected with HTTP 413 before upload begins
  • Error: "File too large. Maximum size is 10GB"

Concurrent Upload of Same Content

Race Condition Scenario: Two clients upload identical content simultaneously.

Handling:

  1. S3 Level: Both compute same hash, both check existence, both may upload
  2. Database Level: Row-level locking with SELECT ... FOR UPDATE
  3. Outcome: One creates artifact, other sees it exists, both succeed
  4. Trigger Safety: SQL triggers are atomic per row

No Data Corruption: S3 is eventually consistent; identical content = identical result.

Upload Interrupted

Scenario: Upload fails after hash computed but before S3 write completes.

Simple Upload:

  • S3 put_object is atomic - either completes or fails entirely
  • No cleanup needed

Multipart Upload:

  • On any failure, abort_multipart_upload is called
  • S3 cleans up partial parts
  • No orphaned data

DB Exists but S3 Missing

Detection: Download request finds artifact in DB but S3 returns 404.

Current Behavior: Return 500 error to client.

Recovery Options (not yet implemented):

  1. Mark artifact for re-upload (set flag, notify admins)
  2. Decrement ref_count to trigger garbage collection
  3. Return specific error code for client retry

Recommended: Log critical alert, return 503 with retry hint.

S3 Exists but DB Missing

Detection: Orphan - file in S3 with no corresponding DB record.

Cause:

  • Failed transaction after S3 upload
  • Manual S3 manipulation
  • Database restore from backup

Recovery:

  • Garbage collection won't delete (no DB record to query)
  • Requires S3 bucket scan + DB reconciliation
  • Manual admin task (out of scope for MVP)

Network Timeout During Existence Check

Behavior: Retry up to 3 times with adaptive backoff.

After Retries Exhausted: Raise S3ExistenceCheckError, return 503 to client.

Rationale: Don't upload without knowing if duplicate exists (prevents orphans).


Collision Handling

SHA256 Collision Probability

For random inputs, the probability of collision is:

P(collision) ≈ n² / 2^257

Where n = number of unique files
Files Collision Probability
10^9 (1 billion) 10^-59
10^12 (1 trillion) 10^-53
10^18 10^-41

Practical Assessment: You would need to store more files than atoms in the observable universe to have meaningful collision risk.

Detection Mechanism

Despite near-zero probability, we detect potential collisions by:

  1. Size Comparison: If hash matches but sizes differ, CRITICAL alert
  2. ETag Verification: S3 ETag provides secondary check

Handling Procedure

If collision detected (size mismatch):

  1. Log CRITICAL alert with full details
  2. Reject upload with 500 error
  3. Do NOT overwrite existing content
  4. Notify operations for manual investigation
raise HashCollisionError(
    f"Hash collision detected for {sha256_hash}: size mismatch"
)

MVP Position

For MVP, we:

  • Detect collisions via size mismatch
  • Log and alert on detection
  • Reject conflicting upload
  • Accept that true collisions are practically impossible

No active mitigation (e.g., storing hash + size as composite key) is needed.


Performance Considerations

Hash Computation Overhead

File Size Hash Time Upload Time (100 Mbps) Overhead
10 MB 25ms 800ms 3%
100 MB 250ms 8s 3%
1 GB 2.5s 80s 3%
10 GB 25s 800s 3%

Conclusion: Hash computation adds ~3% overhead regardless of file size. Network I/O dominates.

Existence Check Overhead

  • S3 HEAD request: ~50-100ms per call
  • Cached in future: Could use Redis/memory cache for hot paths
  • Current MVP: No caching (acceptable for expected load)

Deduplication Savings

Example with 50% duplication rate:

Metric Without Dedup With Dedup Savings
Storage (100K files, 10MB avg) 1 TB 500 GB 50%
Upload bandwidth 1 TB 500 GB 50%
S3 costs $23/mo $11.50/mo 50%

Operations Runbook

Monitoring Deduplication

# View deduplication stats
curl http://orchard:8080/api/v1/stats/deduplication

# Response includes:
# - deduplication_ratio
# - total_uploads, deduplicated_uploads
# - bytes_saved

Checking for Orphaned Artifacts

# List orphaned artifacts (ref_count = 0)
curl http://orchard:8080/api/v1/admin/orphaned-artifacts

# Dry-run garbage collection
curl -X POST "http://orchard:8080/api/v1/admin/garbage-collect?dry_run=true"

# Execute garbage collection
curl -X POST "http://orchard:8080/api/v1/admin/garbage-collect?dry_run=false"

Verifying Artifact Integrity

# Download and verify hash matches artifact ID
ARTIFACT_ID="dffd6021bb2bd5b0af676290809ec3a53191dd81c7f70a4b28688a362182986f"
curl -O http://orchard:8080/api/v1/artifact/$ARTIFACT_ID/download
COMPUTED=$(sha256sum downloaded_file | cut -d' ' -f1)
[ "$ARTIFACT_ID" = "$COMPUTED" ] && echo "OK" || echo "INTEGRITY FAILURE"

Troubleshooting

Symptom Likely Cause Resolution
"Hash computation error" Empty file or read error Check file content, retry
"Storage unavailable" S3/MinIO down Check S3 health, retry
"File too large" Exceeds max_file_size Adjust config or use chunked upload
"Hash collision detected" Extremely rare Investigate, do not ignore
Orphaned artifacts accumulating Tags deleted, no GC run Run garbage collection
Download returns 404 S3 object missing Check S3 bucket, restore from backup

Configuration Reference

Variable Default Description
ORCHARD_MAX_FILE_SIZE 10GB Maximum upload size
ORCHARD_MIN_FILE_SIZE 1 Minimum upload size (rejects empty)
ORCHARD_S3_MAX_RETRIES 3 Retry attempts for S3 operations
ORCHARD_S3_CONNECT_TIMEOUT 10s S3 connection timeout
ORCHARD_S3_READ_TIMEOUT 60s S3 read timeout

Appendix: Decision Records

ADR-001: SHA256 for Content Hashing

Status: Accepted

Context: Need deterministic content identifier for deduplication.

Decision: Use SHA256.

Rationale:

  • Cryptographically strong (no known attacks)
  • Universal adoption (Git, Docker, npm)
  • Sufficient speed for I/O-bound workloads
  • Excellent tooling

Consequences:

  • 64-character artifact IDs (longer than UUIDs)
  • CPU overhead ~3% of upload time
  • Future algorithm migration requires versioning

ADR-002: Whole-File Deduplication Only

Status: Accepted

Context: Could implement chunk-level deduplication for better savings.

Decision: Whole-file only for MVP.

Rationale:

  • Simpler implementation
  • No chunking algorithm complexity
  • Sufficient for build artifact use case
  • Can add chunk-level later if needed

Consequences:

  • Files with partial overlap stored entirely
  • Large files with small changes not deduplicated
  • Acceptable for binary artifact workloads

ADR-003: SQL Triggers for ref_count

Status: Accepted

Context: ref_count must be accurate for garbage collection.

Decision: Use PostgreSQL triggers, not application code.

Rationale:

  • Atomic with tag operations
  • Cannot be bypassed
  • Works regardless of client (API, direct SQL, migrations)
  • Simpler application code

Consequences:

  • Trigger logic in SQL (less visible)
  • Must maintain triggers across schema changes
  • Debugging requires database access