bitforge/orchard

Fork 0

Files

Mondo Diaz 7e68baed08 Add ref_count management for deletions with atomic operations and error handling

2026-01-06 13:44:23 -06:00

23 KiB

Raw Blame History

Deduplication Design Document

This document defines Orchard's content-addressable storage and deduplication approach using SHA256 hashes.

Overview
Hash Algorithm Selection
Content-Addressable Storage Model
S3 Key Derivation
Duplicate Detection Strategy
Reference Counting Lifecycle
Edge Cases and Error Handling
Collision Handling
Performance Considerations
Operations Runbook

Overview

Orchard uses whole-file deduplication based on content hashing. When a file is uploaded:

The SHA256 hash of the entire file content is computed
The hash becomes the artifact's primary identifier
If a file with the same hash already exists, no duplicate is stored
Multiple tags/references can point to the same artifact

Scope: Orchard implements whole-file deduplication only. Chunk-level or block-level deduplication is out of scope for MVP.

Hash Algorithm Selection

Decision: SHA256

Criteria	SHA256	SHA1	MD5	Blake3
Security	Strong (256-bit)	Weak (broken)	Weak (broken)	Strong
Speed	~400 MB/s	~600 MB/s	~800 MB/s	~1500 MB/s
Collision Resistance	2^128	Broken	Broken	2^128
Industry Adoption	Universal	Legacy	Legacy	Emerging
Tool Ecosystem	Excellent	Good	Good	Growing

Rationale

Security: SHA256 has no known practical collision attacks. SHA1 and MD5 are cryptographically broken.
Collision Resistance: With 256-bit output, the probability of accidental collision is approximately 2^-128 (~10^-38). To have a 50% chance of collision, you would need approximately 2^128 unique files.
Industry Standard: SHA256 is the de facto standard for content-addressable storage (Git, Docker, npm, etc.).
Performance: While Blake3 is faster, SHA256 throughput (~400 MB/s) exceeds typical network bandwidth for uploads. The bottleneck is I/O, not hashing.
Tooling: Universal support in all languages, operating systems, and verification tools.

Migration Path

If a future algorithm change is needed (e.g., SHA3 or Blake3):

Database: Add hash_algorithm column to artifacts table (default: 'sha256')
S3 Keys: New algorithm uses different prefix (e.g., fruits-sha3/ vs fruits/)
API: Accept algorithm hint in upload, return algorithm in responses
Migration: Background job to re-hash existing artifacts if needed

Current Implementation: Single algorithm (SHA256), no algorithm versioning required for MVP.

Content-Addressable Storage Model

Core Principles

Identity = Content: The artifact ID IS the SHA256 hash of its content
Immutability: Content cannot change after storage (same hash = same content)
Deduplication: Same content uploaded twice results in single storage
Metadata Independence: Files with identical content but different names/types are deduplicated

Data Model

Artifact {
    id: VARCHAR(64) PRIMARY KEY  -- SHA256 hash (lowercase hex)
    size: BIGINT                 -- File size in bytes
    ref_count: INTEGER           -- Number of references
    s3_key: VARCHAR(1024)        -- S3 storage path
    checksum_md5: VARCHAR(32)    -- Secondary checksum
    checksum_sha1: VARCHAR(40)   -- Secondary checksum
    ...
}

Tag {
    id: UUID PRIMARY KEY
    name: VARCHAR(255)
    package_id: UUID FK
    artifact_id: VARCHAR(64) FK  -- Points to Artifact.id (SHA256)
}

Hash Format

Algorithm: SHA256
Output: 64 lowercase hexadecimal characters
Example: dffd6021bb2bd5b0af676290809ec3a53191dd81c7f70a4b28688a362182986f

S3 Key Derivation

Key Structure

fruits/{hash[0:2]}/{hash[2:4]}/{full_hash}

Example for hash dffd6021bb2bd5b0...:

fruits/df/fd/dffd6021bb2bd5b0af676290809ec3a53191dd81c7f70a4b28688a362182986f

Rationale for Prefix Sharding

S3 Performance: S3 partitions by key prefix. Distributing across prefixes improves throughput.
Filesystem Compatibility: When using filesystem-backed storage, avoids single directory with millions of files.
Distribution: With 2-character prefixes (256 combinations each level), provides 65,536 (256 x 256) top-level buckets.

Bucket Distribution Analysis

Assuming uniformly distributed SHA256 hashes:

Artifacts	Files per Prefix (avg)	Max per Prefix (99.9%)
100,000	1.5	10
1,000,000	15	50
10,000,000	152	250
100,000,000	1,525	2,000

The two-level prefix provides excellent distribution up to hundreds of millions of artifacts.

Duplicate Detection Strategy

Upload Flow

┌─────────────────────────────────────────────────────────────────┐
│                        UPLOAD REQUEST                            │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  1. VALIDATE: Check file size limits (min/max)                   │
│     - Empty files (0 bytes) → Reject with 422                    │
│     - Exceeds max_file_size → Reject with 413                    │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  2. COMPUTE HASH: Stream file through SHA256/MD5/SHA1            │
│     - Use 8MB chunks for memory efficiency                       │
│     - Single pass for all three hashes                           │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  3. DERIVE S3 KEY: fruits/{hash[0:2]}/{hash[2:4]}/{hash}        │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  4. CHECK EXISTENCE: HEAD request to S3 for derived key          │
│     - Retry up to 3 times on transient failures                  │
└─────────────────────────────────────────────────────────────────┘
                              │
              ┌───────────────┴───────────────┐
              ▼                               ▼
┌─────────────────────────┐     ┌─────────────────────────────────┐
│  EXISTS: Deduplicated    │     │  NOT EXISTS: Upload to S3       │
│  - Verify size matches   │     │  - PUT object (or multipart)    │
│  - Skip S3 upload        │     │  - Abort on failure             │
│  - Log saved bytes       │     └─────────────────────────────────┘
└─────────────────────────┘                   │
              │                               │
              └───────────────┬───────────────┘
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  5. DATABASE: Create/update artifact record                      │
│     - Use row locking to prevent race conditions                 │
│     - ref_count managed by SQL triggers                          │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  6. CREATE TAG: If tag provided, create/update tag               │
│     - SQL trigger increments ref_count                           │
└─────────────────────────────────────────────────────────────────┘

Hash Computation

Memory Requirements:

Chunk size: 8MB (HASH_CHUNK_SIZE)
Working memory: ~25MB (8MB chunk + hash states)
Independent of file size (streaming)

Throughput:

SHA256 alone: ~400 MB/s on modern CPU
With MD5 + SHA1: ~300 MB/s (parallel computation)
Typical bottleneck: Network I/O, not CPU

Multipart Upload Threshold

Files larger than 100MB use S3 multipart upload:

First pass: Stream to compute hashes
If not duplicate: Seek to start, upload in 10MB parts
On failure: Abort multipart upload (no orphaned parts)

Reference Counting Lifecycle

What Constitutes a "Reference"

A reference is a Tag pointing to an artifact. Each tag increments the ref_count by 1.

Uploads do NOT directly increment ref_count - only tag creation does.

Lifecycle

┌─────────────────────────────────────────────────────────────────┐
│  CREATE: New artifact uploaded                                   │
│  - ref_count = 0 (no tags yet)                                   │
│  - Artifact exists but is "orphaned"                             │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  TAG CREATED: Tag points to artifact                             │
│  - SQL trigger: ref_count += 1                                   │
│  - Artifact is now referenced                                    │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  TAG UPDATED: Tag moved to different artifact                    │
│  - SQL trigger on old artifact: ref_count -= 1                   │
│  - SQL trigger on new artifact: ref_count += 1                   │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  TAG DELETED: Tag removed                                        │
│  - SQL trigger: ref_count -= 1                                   │
│  - If ref_count = 0, artifact is orphaned                        │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  GARBAGE COLLECTION: Clean up orphaned artifacts                 │
│  - Triggered manually via admin endpoint                         │
│  - Finds artifacts where ref_count = 0                           │
│  - Deletes from S3 and database                                  │
└─────────────────────────────────────────────────────────────────┘

SQL Triggers

Three triggers manage ref_count automatically:

tags_ref_count_insert_trigger: On tag INSERT, increment target artifact's ref_count
tags_ref_count_delete_trigger: On tag DELETE, decrement target artifact's ref_count
tags_ref_count_update_trigger: On tag UPDATE (artifact_id changed), decrement old, increment new

Garbage Collection

Trigger: Manual admin endpoint (POST /api/v1/admin/garbage-collect)

Process:

Query artifacts where ref_count = 0
For each orphan:
- Delete from S3 (DELETE fruits/xx/yy/hash)
- Delete from database
- Log deletion

Safety:

Dry-run mode by default (?dry_run=true)
Limit per run (?limit=100)
Check constraint prevents ref_count < 0

Edge Cases and Error Handling

Empty Files

Behavior: Rejected with HTTP 422
Reason: Empty content has deterministic hash but provides no value
Error: "Empty files are not allowed"

Maximum File Size

Default Limit: 10GB (ORCHARD_MAX_FILE_SIZE)
Configurable: Via environment variable
Behavior: Rejected with HTTP 413 before upload begins
Error: "File too large. Maximum size is 10GB"

Concurrent Upload of Same Content

Race Condition Scenario: Two clients upload identical content simultaneously.

Handling:

S3 Level: Both compute same hash, both check existence, both may upload
Database Level: Row-level locking with SELECT ... FOR UPDATE
Outcome: One creates artifact, other sees it exists, both succeed
Trigger Safety: SQL triggers are atomic per row

No Data Corruption: S3 is eventually consistent; identical content = identical result.

Upload Interrupted

Scenario: Upload fails after hash computed but before S3 write completes.

Simple Upload:

S3 put_object is atomic - either completes or fails entirely
No cleanup needed

Multipart Upload:

On any failure, abort_multipart_upload is called
S3 cleans up partial parts
No orphaned data

DB Exists but S3 Missing

Detection: Download request finds artifact in DB but S3 returns 404.

Current Behavior: Return 500 error to client.

Recovery Options (not yet implemented):

Mark artifact for re-upload (set flag, notify admins)
Decrement ref_count to trigger garbage collection
Return specific error code for client retry

Recommended: Log critical alert, return 503 with retry hint.

S3 Exists but DB Missing

Detection: Orphan - file in S3 with no corresponding DB record.

Cause:

Failed transaction after S3 upload
Manual S3 manipulation
Database restore from backup

Recovery:

Garbage collection won't delete (no DB record to query)
Requires S3 bucket scan + DB reconciliation
Manual admin task (out of scope for MVP)

Network Timeout During Existence Check

Behavior: Retry up to 3 times with adaptive backoff.

After Retries Exhausted: Raise S3ExistenceCheckError, return 503 to client.

Rationale: Don't upload without knowing if duplicate exists (prevents orphans).

Collision Handling

SHA256 Collision Probability

For random inputs, the probability of collision is:

P(collision) ≈ n² / 2^257

Where n = number of unique files

Files	Collision Probability
10^9 (1 billion)	10^-59
10^12 (1 trillion)	10^-53
10^18	10^-41

Practical Assessment: You would need to store more files than atoms in the observable universe to have meaningful collision risk.

Detection Mechanism

Despite near-zero probability, we detect potential collisions by:

Size Comparison: If hash matches but sizes differ, CRITICAL alert
ETag Verification: S3 ETag provides secondary check

Handling Procedure

If collision detected (size mismatch):

Log CRITICAL alert with full details
Reject upload with 500 error
Do NOT overwrite existing content
Notify operations for manual investigation

raise HashCollisionError(
    f"Hash collision detected for {sha256_hash}: size mismatch"
)

MVP Position

For MVP, we:

Detect collisions via size mismatch
Log and alert on detection
Reject conflicting upload
Accept that true collisions are practically impossible

No active mitigation (e.g., storing hash + size as composite key) is needed.

Performance Considerations

Hash Computation Overhead

File Size	Hash Time	Upload Time (100 Mbps)	Overhead
10 MB	25ms	800ms	3%
100 MB	250ms	8s	3%
1 GB	2.5s	80s	3%
10 GB	25s	800s	3%

Conclusion: Hash computation adds ~3% overhead regardless of file size. Network I/O dominates.

Existence Check Overhead

S3 HEAD request: ~50-100ms per call
Cached in future: Could use Redis/memory cache for hot paths
Current MVP: No caching (acceptable for expected load)

Deduplication Savings

Example with 50% duplication rate:

Metric	Without Dedup	With Dedup	Savings
Storage (100K files, 10MB avg)	1 TB	500 GB	50%
Upload bandwidth	1 TB	500 GB	50%
S3 costs	$23/mo	$11.50/mo	50%

Operations Runbook

Monitoring Deduplication

# View deduplication stats
curl http://orchard:8080/api/v1/stats/deduplication

# Response includes:
# - deduplication_ratio
# - total_uploads, deduplicated_uploads
# - bytes_saved

Checking for Orphaned Artifacts

# List orphaned artifacts (ref_count = 0)
curl http://orchard:8080/api/v1/admin/orphaned-artifacts

# Dry-run garbage collection
curl -X POST "http://orchard:8080/api/v1/admin/garbage-collect?dry_run=true"

# Execute garbage collection
curl -X POST "http://orchard:8080/api/v1/admin/garbage-collect?dry_run=false"

Verifying Artifact Integrity

# Download and verify hash matches artifact ID
ARTIFACT_ID="dffd6021bb2bd5b0af676290809ec3a53191dd81c7f70a4b28688a362182986f"
curl -O http://orchard:8080/api/v1/artifact/$ARTIFACT_ID/download
COMPUTED=$(sha256sum downloaded_file | cut -d' ' -f1)
[ "$ARTIFACT_ID" = "$COMPUTED" ] && echo "OK" || echo "INTEGRITY FAILURE"

Troubleshooting

Symptom	Likely Cause	Resolution
"Hash computation error"	Empty file or read error	Check file content, retry
"Storage unavailable"	S3/MinIO down	Check S3 health, retry
"File too large"	Exceeds max_file_size	Adjust config or use chunked upload
"Hash collision detected"	Extremely rare	Investigate, do not ignore
Orphaned artifacts accumulating	Tags deleted, no GC run	Run garbage collection
Download returns 404	S3 object missing	Check S3 bucket, restore from backup

Configuration Reference

Variable	Default	Description
`ORCHARD_MAX_FILE_SIZE`	10GB	Maximum upload size
`ORCHARD_MIN_FILE_SIZE`	1	Minimum upload size (rejects empty)
`ORCHARD_S3_MAX_RETRIES`	3	Retry attempts for S3 operations
`ORCHARD_S3_CONNECT_TIMEOUT`	10s	S3 connection timeout
`ORCHARD_S3_READ_TIMEOUT`	60s	S3 read timeout

Appendix: Decision Records

ADR-001: SHA256 for Content Hashing

Status: Accepted

Context: Need deterministic content identifier for deduplication.

Decision: Use SHA256.

Rationale:

Cryptographically strong (no known attacks)
Universal adoption (Git, Docker, npm)
Sufficient speed for I/O-bound workloads
Excellent tooling

Consequences:

64-character artifact IDs (longer than UUIDs)
CPU overhead ~3% of upload time
Future algorithm migration requires versioning

ADR-002: Whole-File Deduplication Only

Status: Accepted

Context: Could implement chunk-level deduplication for better savings.

Decision: Whole-file only for MVP.

Rationale:

Simpler implementation
No chunking algorithm complexity
Sufficient for build artifact use case
Can add chunk-level later if needed

Consequences:

Files with partial overlap stored entirely
Large files with small changes not deduplicated
Acceptable for binary artifact workloads

ADR-003: SQL Triggers for ref_count

Status: Accepted

Context: ref_count must be accurate for garbage collection.

Decision: Use PostgreSQL triggers, not application code.

Rationale:

Atomic with tag operations
Cannot be bypassed
Works regardless of client (API, direct SQL, migrations)
Simpler application code

Consequences:

Trigger logic in SQL (less visible)
Must maintain triggers across schema changes
Debugging requires database access

23 KiB Raw Blame History

Deduplication Design Document

Table of Contents

Overview

Hash Algorithm Selection

Decision: SHA256

Rationale

Migration Path

Content-Addressable Storage Model

Core Principles

Data Model

Hash Format

S3 Key Derivation

Key Structure

Rationale for Prefix Sharding

Bucket Distribution Analysis

Duplicate Detection Strategy

Upload Flow

Hash Computation

Multipart Upload Threshold

Reference Counting Lifecycle

What Constitutes a "Reference"

Lifecycle

SQL Triggers

Garbage Collection

Edge Cases and Error Handling

Empty Files

Maximum File Size

Concurrent Upload of Same Content

Upload Interrupted

DB Exists but S3 Missing

S3 Exists but DB Missing

Network Timeout During Existence Check

Collision Handling

SHA256 Collision Probability

Detection Mechanism

Handling Procedure

MVP Position

Performance Considerations

Hash Computation Overhead

Existence Check Overhead

Deduplication Savings

Operations Runbook

Monitoring Deduplication

Checking for Orphaned Artifacts

Verifying Artifact Integrity

Troubleshooting

Configuration Reference

Appendix: Decision Records

ADR-001: SHA256 for Content Hashing

ADR-002: Whole-File Deduplication Only

ADR-003: SQL Triggers for ref_count

23 KiB

Raw Blame History