|
|
|
|
@@ -0,0 +1,575 @@
|
|
|
|
|
# Deduplication Design Document
|
|
|
|
|
|
|
|
|
|
This document defines Orchard's content-addressable storage and deduplication approach using SHA256 hashes.
|
|
|
|
|
|
|
|
|
|
## Table of Contents
|
|
|
|
|
|
|
|
|
|
1. [Overview](#overview)
|
|
|
|
|
2. [Hash Algorithm Selection](#hash-algorithm-selection)
|
|
|
|
|
3. [Content-Addressable Storage Model](#content-addressable-storage-model)
|
|
|
|
|
4. [S3 Key Derivation](#s3-key-derivation)
|
|
|
|
|
5. [Duplicate Detection Strategy](#duplicate-detection-strategy)
|
|
|
|
|
6. [Reference Counting Lifecycle](#reference-counting-lifecycle)
|
|
|
|
|
7. [Edge Cases and Error Handling](#edge-cases-and-error-handling)
|
|
|
|
|
8. [Collision Handling](#collision-handling)
|
|
|
|
|
9. [Performance Considerations](#performance-considerations)
|
|
|
|
|
10. [Operations Runbook](#operations-runbook)
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## Overview
|
|
|
|
|
|
|
|
|
|
Orchard uses **whole-file deduplication** based on content hashing. When a file is uploaded:
|
|
|
|
|
|
|
|
|
|
1. The SHA256 hash of the entire file content is computed
|
|
|
|
|
2. The hash becomes the artifact's primary identifier
|
|
|
|
|
3. If a file with the same hash already exists, no duplicate is stored
|
|
|
|
|
4. Multiple tags/references can point to the same artifact
|
|
|
|
|
|
|
|
|
|
**Scope:** Orchard implements whole-file deduplication only. Chunk-level or block-level deduplication is out of scope for MVP.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## Hash Algorithm Selection
|
|
|
|
|
|
|
|
|
|
### Decision: SHA256
|
|
|
|
|
|
|
|
|
|
| Criteria | SHA256 | SHA1 | MD5 | Blake3 |
|
|
|
|
|
|----------|--------|------|-----|--------|
|
|
|
|
|
| Security | Strong (256-bit) | Weak (broken) | Weak (broken) | Strong |
|
|
|
|
|
| Speed | ~400 MB/s | ~600 MB/s | ~800 MB/s | ~1500 MB/s |
|
|
|
|
|
| Collision Resistance | 2^128 | Broken | Broken | 2^128 |
|
|
|
|
|
| Industry Adoption | Universal | Legacy | Legacy | Emerging |
|
|
|
|
|
| Tool Ecosystem | Excellent | Good | Good | Growing |
|
|
|
|
|
|
|
|
|
|
### Rationale
|
|
|
|
|
|
|
|
|
|
1. **Security**: SHA256 has no known practical collision attacks. SHA1 and MD5 are cryptographically broken.
|
|
|
|
|
|
|
|
|
|
2. **Collision Resistance**: With 256-bit output, the probability of accidental collision is approximately 2^-128 (~10^-38). To have a 50% chance of collision, you would need approximately 2^128 unique files.
|
|
|
|
|
|
|
|
|
|
3. **Industry Standard**: SHA256 is the de facto standard for content-addressable storage (Git, Docker, npm, etc.).
|
|
|
|
|
|
|
|
|
|
4. **Performance**: While Blake3 is faster, SHA256 throughput (~400 MB/s) exceeds typical network bandwidth for uploads. The bottleneck is I/O, not hashing.
|
|
|
|
|
|
|
|
|
|
5. **Tooling**: Universal support in all languages, operating systems, and verification tools.
|
|
|
|
|
|
|
|
|
|
### Migration Path
|
|
|
|
|
|
|
|
|
|
If a future algorithm change is needed (e.g., SHA3 or Blake3):
|
|
|
|
|
|
|
|
|
|
1. **Database**: Add `hash_algorithm` column to artifacts table (default: 'sha256')
|
|
|
|
|
2. **S3 Keys**: New algorithm uses different prefix (e.g., `fruits-sha3/` vs `fruits/`)
|
|
|
|
|
3. **API**: Accept algorithm hint in upload, return algorithm in responses
|
|
|
|
|
4. **Migration**: Background job to re-hash existing artifacts if needed
|
|
|
|
|
|
|
|
|
|
**Current Implementation**: Single algorithm (SHA256), no algorithm versioning required for MVP.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## Content-Addressable Storage Model
|
|
|
|
|
|
|
|
|
|
### Core Principles
|
|
|
|
|
|
|
|
|
|
1. **Identity = Content**: The artifact ID IS the SHA256 hash of its content
|
|
|
|
|
2. **Immutability**: Content cannot change after storage (same hash = same content)
|
|
|
|
|
3. **Deduplication**: Same content uploaded twice results in single storage
|
|
|
|
|
4. **Metadata Independence**: Files with identical content but different names/types are deduplicated
|
|
|
|
|
|
|
|
|
|
### Data Model
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
Artifact {
|
|
|
|
|
id: VARCHAR(64) PRIMARY KEY -- SHA256 hash (lowercase hex)
|
|
|
|
|
size: BIGINT -- File size in bytes
|
|
|
|
|
ref_count: INTEGER -- Number of references
|
|
|
|
|
s3_key: VARCHAR(1024) -- S3 storage path
|
|
|
|
|
checksum_md5: VARCHAR(32) -- Secondary checksum
|
|
|
|
|
checksum_sha1: VARCHAR(40) -- Secondary checksum
|
|
|
|
|
...
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
Tag {
|
|
|
|
|
id: UUID PRIMARY KEY
|
|
|
|
|
name: VARCHAR(255)
|
|
|
|
|
package_id: UUID FK
|
|
|
|
|
artifact_id: VARCHAR(64) FK -- Points to Artifact.id (SHA256)
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Hash Format
|
|
|
|
|
|
|
|
|
|
- Algorithm: SHA256
|
|
|
|
|
- Output: 64 lowercase hexadecimal characters
|
|
|
|
|
- Example: `dffd6021bb2bd5b0af676290809ec3a53191dd81c7f70a4b28688a362182986f`
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## S3 Key Derivation
|
|
|
|
|
|
|
|
|
|
### Key Structure
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
fruits/{hash[0:2]}/{hash[2:4]}/{full_hash}
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Example for hash `dffd6021bb2bd5b0...`:
|
|
|
|
|
```
|
|
|
|
|
fruits/df/fd/dffd6021bb2bd5b0af676290809ec3a53191dd81c7f70a4b28688a362182986f
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Rationale for Prefix Sharding
|
|
|
|
|
|
|
|
|
|
1. **S3 Performance**: S3 partitions by key prefix. Distributing across prefixes improves throughput.
|
|
|
|
|
|
|
|
|
|
2. **Filesystem Compatibility**: When using filesystem-backed storage, avoids single directory with millions of files.
|
|
|
|
|
|
|
|
|
|
3. **Distribution**: With 2-character prefixes (256 combinations each level), provides 65,536 (256 x 256) top-level buckets.
|
|
|
|
|
|
|
|
|
|
### Bucket Distribution Analysis
|
|
|
|
|
|
|
|
|
|
Assuming uniformly distributed SHA256 hashes:
|
|
|
|
|
|
|
|
|
|
| Artifacts | Files per Prefix (avg) | Max per Prefix (99.9%) |
|
|
|
|
|
|-----------|------------------------|------------------------|
|
|
|
|
|
| 100,000 | 1.5 | 10 |
|
|
|
|
|
| 1,000,000 | 15 | 50 |
|
|
|
|
|
| 10,000,000 | 152 | 250 |
|
|
|
|
|
| 100,000,000 | 1,525 | 2,000 |
|
|
|
|
|
|
|
|
|
|
The two-level prefix provides excellent distribution up to hundreds of millions of artifacts.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## Duplicate Detection Strategy
|
|
|
|
|
|
|
|
|
|
### Upload Flow
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
|
|
|
│ UPLOAD REQUEST │
|
|
|
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
|
|
|
│
|
|
|
|
|
▼
|
|
|
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
|
|
|
│ 1. VALIDATE: Check file size limits (min/max) │
|
|
|
|
|
│ - Empty files (0 bytes) → Reject with 422 │
|
|
|
|
|
│ - Exceeds max_file_size → Reject with 413 │
|
|
|
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
|
|
|
│
|
|
|
|
|
▼
|
|
|
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
|
|
|
│ 2. COMPUTE HASH: Stream file through SHA256/MD5/SHA1 │
|
|
|
|
|
│ - Use 8MB chunks for memory efficiency │
|
|
|
|
|
│ - Single pass for all three hashes │
|
|
|
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
|
|
|
│
|
|
|
|
|
▼
|
|
|
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
|
|
|
│ 3. DERIVE S3 KEY: fruits/{hash[0:2]}/{hash[2:4]}/{hash} │
|
|
|
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
|
|
|
│
|
|
|
|
|
▼
|
|
|
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
|
|
|
│ 4. CHECK EXISTENCE: HEAD request to S3 for derived key │
|
|
|
|
|
│ - Retry up to 3 times on transient failures │
|
|
|
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
|
|
|
│
|
|
|
|
|
┌───────────────┴───────────────┐
|
|
|
|
|
▼ ▼
|
|
|
|
|
┌─────────────────────────┐ ┌─────────────────────────────────┐
|
|
|
|
|
│ EXISTS: Deduplicated │ │ NOT EXISTS: Upload to S3 │
|
|
|
|
|
│ - Verify size matches │ │ - PUT object (or multipart) │
|
|
|
|
|
│ - Skip S3 upload │ │ - Abort on failure │
|
|
|
|
|
│ - Log saved bytes │ └─────────────────────────────────┘
|
|
|
|
|
└─────────────────────────┘ │
|
|
|
|
|
│ │
|
|
|
|
|
└───────────────┬───────────────┘
|
|
|
|
|
▼
|
|
|
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
|
|
|
│ 5. DATABASE: Create/update artifact record │
|
|
|
|
|
│ - Use row locking to prevent race conditions │
|
|
|
|
|
│ - ref_count managed by SQL triggers │
|
|
|
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
|
|
|
│
|
|
|
|
|
▼
|
|
|
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
|
|
|
│ 6. CREATE TAG: If tag provided, create/update tag │
|
|
|
|
|
│ - SQL trigger increments ref_count │
|
|
|
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Hash Computation
|
|
|
|
|
|
|
|
|
|
**Memory Requirements:**
|
|
|
|
|
- Chunk size: 8MB (`HASH_CHUNK_SIZE`)
|
|
|
|
|
- Working memory: ~25MB (8MB chunk + hash states)
|
|
|
|
|
- Independent of file size (streaming)
|
|
|
|
|
|
|
|
|
|
**Throughput:**
|
|
|
|
|
- SHA256 alone: ~400 MB/s on modern CPU
|
|
|
|
|
- With MD5 + SHA1: ~300 MB/s (parallel computation)
|
|
|
|
|
- Typical bottleneck: Network I/O, not CPU
|
|
|
|
|
|
|
|
|
|
### Multipart Upload Threshold
|
|
|
|
|
|
|
|
|
|
Files larger than 100MB use S3 multipart upload:
|
|
|
|
|
- First pass: Stream to compute hashes
|
|
|
|
|
- If not duplicate: Seek to start, upload in 10MB parts
|
|
|
|
|
- On failure: Abort multipart upload (no orphaned parts)
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## Reference Counting Lifecycle
|
|
|
|
|
|
|
|
|
|
### What Constitutes a "Reference"
|
|
|
|
|
|
|
|
|
|
A reference is a **Tag** pointing to an artifact. Each tag increments the ref_count by 1.
|
|
|
|
|
|
|
|
|
|
**Uploads do NOT directly increment ref_count** - only tag creation does.
|
|
|
|
|
|
|
|
|
|
### Lifecycle
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
|
|
|
│ CREATE: New artifact uploaded │
|
|
|
|
|
│ - ref_count = 0 (no tags yet) │
|
|
|
|
|
│ - Artifact exists but is "orphaned" │
|
|
|
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
|
|
|
│
|
|
|
|
|
▼
|
|
|
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
|
|
|
│ TAG CREATED: Tag points to artifact │
|
|
|
|
|
│ - SQL trigger: ref_count += 1 │
|
|
|
|
|
│ - Artifact is now referenced │
|
|
|
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
|
|
|
│
|
|
|
|
|
▼
|
|
|
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
|
|
|
│ TAG UPDATED: Tag moved to different artifact │
|
|
|
|
|
│ - SQL trigger on old artifact: ref_count -= 1 │
|
|
|
|
|
│ - SQL trigger on new artifact: ref_count += 1 │
|
|
|
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
|
|
|
│
|
|
|
|
|
▼
|
|
|
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
|
|
|
│ TAG DELETED: Tag removed │
|
|
|
|
|
│ - SQL trigger: ref_count -= 1 │
|
|
|
|
|
│ - If ref_count = 0, artifact is orphaned │
|
|
|
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
|
|
|
│
|
|
|
|
|
▼
|
|
|
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
|
|
|
│ GARBAGE COLLECTION: Clean up orphaned artifacts │
|
|
|
|
|
│ - Triggered manually via admin endpoint │
|
|
|
|
|
│ - Finds artifacts where ref_count = 0 │
|
|
|
|
|
│ - Deletes from S3 and database │
|
|
|
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### SQL Triggers
|
|
|
|
|
|
|
|
|
|
Three triggers manage ref_count automatically:
|
|
|
|
|
|
|
|
|
|
1. **`tags_ref_count_insert_trigger`**: On tag INSERT, increment target artifact's ref_count
|
|
|
|
|
2. **`tags_ref_count_delete_trigger`**: On tag DELETE, decrement target artifact's ref_count
|
|
|
|
|
3. **`tags_ref_count_update_trigger`**: On tag UPDATE (artifact_id changed), decrement old, increment new
|
|
|
|
|
|
|
|
|
|
### Garbage Collection
|
|
|
|
|
|
|
|
|
|
**Trigger**: Manual admin endpoint (`POST /api/v1/admin/garbage-collect`)
|
|
|
|
|
|
|
|
|
|
**Process**:
|
|
|
|
|
1. Query artifacts where `ref_count = 0`
|
|
|
|
|
2. For each orphan:
|
|
|
|
|
- Delete from S3 (`DELETE fruits/xx/yy/hash`)
|
|
|
|
|
- Delete from database
|
|
|
|
|
- Log deletion
|
|
|
|
|
|
|
|
|
|
**Safety**:
|
|
|
|
|
- Dry-run mode by default (`?dry_run=true`)
|
|
|
|
|
- Limit per run (`?limit=100`)
|
|
|
|
|
- Check constraint prevents ref_count < 0
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## Edge Cases and Error Handling
|
|
|
|
|
|
|
|
|
|
### Empty Files
|
|
|
|
|
|
|
|
|
|
- **Behavior**: Rejected with HTTP 422
|
|
|
|
|
- **Reason**: Empty content has deterministic hash but provides no value
|
|
|
|
|
- **Error**: "Empty files are not allowed"
|
|
|
|
|
|
|
|
|
|
### Maximum File Size
|
|
|
|
|
|
|
|
|
|
- **Default Limit**: 10GB (`ORCHARD_MAX_FILE_SIZE`)
|
|
|
|
|
- **Configurable**: Via environment variable
|
|
|
|
|
- **Behavior**: Rejected with HTTP 413 before upload begins
|
|
|
|
|
- **Error**: "File too large. Maximum size is 10GB"
|
|
|
|
|
|
|
|
|
|
### Concurrent Upload of Same Content
|
|
|
|
|
|
|
|
|
|
**Race Condition Scenario**: Two clients upload identical content simultaneously.
|
|
|
|
|
|
|
|
|
|
**Handling**:
|
|
|
|
|
1. **S3 Level**: Both compute same hash, both check existence, both may upload
|
|
|
|
|
2. **Database Level**: Row-level locking with `SELECT ... FOR UPDATE`
|
|
|
|
|
3. **Outcome**: One creates artifact, other sees it exists, both succeed
|
|
|
|
|
4. **Trigger Safety**: SQL triggers are atomic per row
|
|
|
|
|
|
|
|
|
|
**No Data Corruption**: S3 is eventually consistent; identical content = identical result.
|
|
|
|
|
|
|
|
|
|
### Upload Interrupted
|
|
|
|
|
|
|
|
|
|
**Scenario**: Upload fails after hash computed but before S3 write completes.
|
|
|
|
|
|
|
|
|
|
**Simple Upload**:
|
|
|
|
|
- S3 put_object is atomic - either completes or fails entirely
|
|
|
|
|
- No cleanup needed
|
|
|
|
|
|
|
|
|
|
**Multipart Upload**:
|
|
|
|
|
- On any failure, `abort_multipart_upload` is called
|
|
|
|
|
- S3 cleans up partial parts
|
|
|
|
|
- No orphaned data
|
|
|
|
|
|
|
|
|
|
### DB Exists but S3 Missing
|
|
|
|
|
|
|
|
|
|
**Detection**: Download request finds artifact in DB but S3 returns 404.
|
|
|
|
|
|
|
|
|
|
**Current Behavior**: Return 500 error to client.
|
|
|
|
|
|
|
|
|
|
**Recovery Options** (not yet implemented):
|
|
|
|
|
1. Mark artifact for re-upload (set flag, notify admins)
|
|
|
|
|
2. Decrement ref_count to trigger garbage collection
|
|
|
|
|
3. Return specific error code for client retry
|
|
|
|
|
|
|
|
|
|
**Recommended**: Log critical alert, return 503 with retry hint.
|
|
|
|
|
|
|
|
|
|
### S3 Exists but DB Missing
|
|
|
|
|
|
|
|
|
|
**Detection**: Orphan - file in S3 with no corresponding DB record.
|
|
|
|
|
|
|
|
|
|
**Cause**:
|
|
|
|
|
- Failed transaction after S3 upload
|
|
|
|
|
- Manual S3 manipulation
|
|
|
|
|
- Database restore from backup
|
|
|
|
|
|
|
|
|
|
**Recovery**:
|
|
|
|
|
- Garbage collection won't delete (no DB record to query)
|
|
|
|
|
- Requires S3 bucket scan + DB reconciliation
|
|
|
|
|
- Manual admin task (out of scope for MVP)
|
|
|
|
|
|
|
|
|
|
### Network Timeout During Existence Check
|
|
|
|
|
|
|
|
|
|
**Behavior**: Retry up to 3 times with adaptive backoff.
|
|
|
|
|
|
|
|
|
|
**After Retries Exhausted**: Raise `S3ExistenceCheckError`, return 503 to client.
|
|
|
|
|
|
|
|
|
|
**Rationale**: Don't upload without knowing if duplicate exists (prevents orphans).
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## Collision Handling
|
|
|
|
|
|
|
|
|
|
### SHA256 Collision Probability
|
|
|
|
|
|
|
|
|
|
For random inputs, the probability of collision is:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
P(collision) ≈ n² / 2^257
|
|
|
|
|
|
|
|
|
|
Where n = number of unique files
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
| Files | Collision Probability |
|
|
|
|
|
|-------|----------------------|
|
|
|
|
|
| 10^9 (1 billion) | 10^-59 |
|
|
|
|
|
| 10^12 (1 trillion) | 10^-53 |
|
|
|
|
|
| 10^18 | 10^-41 |
|
|
|
|
|
|
|
|
|
|
**Practical Assessment**: You would need to store more files than atoms in the observable universe to have meaningful collision risk.
|
|
|
|
|
|
|
|
|
|
### Detection Mechanism
|
|
|
|
|
|
|
|
|
|
Despite near-zero probability, we detect potential collisions by:
|
|
|
|
|
|
|
|
|
|
1. **Size Comparison**: If hash matches but sizes differ, CRITICAL alert
|
|
|
|
|
2. **ETag Verification**: S3 ETag provides secondary check
|
|
|
|
|
|
|
|
|
|
### Handling Procedure
|
|
|
|
|
|
|
|
|
|
If collision detected (size mismatch):
|
|
|
|
|
|
|
|
|
|
1. **Log CRITICAL alert** with full details
|
|
|
|
|
2. **Reject upload** with 500 error
|
|
|
|
|
3. **Do NOT overwrite** existing content
|
|
|
|
|
4. **Notify operations** for manual investigation
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
raise HashCollisionError(
|
|
|
|
|
f"Hash collision detected for {sha256_hash}: size mismatch"
|
|
|
|
|
)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### MVP Position
|
|
|
|
|
|
|
|
|
|
For MVP, we:
|
|
|
|
|
- Detect collisions via size mismatch
|
|
|
|
|
- Log and alert on detection
|
|
|
|
|
- Reject conflicting upload
|
|
|
|
|
- Accept that true collisions are practically impossible
|
|
|
|
|
|
|
|
|
|
No active mitigation (e.g., storing hash + size as composite key) is needed.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## Performance Considerations
|
|
|
|
|
|
|
|
|
|
### Hash Computation Overhead
|
|
|
|
|
|
|
|
|
|
| File Size | Hash Time | Upload Time (100 Mbps) | Overhead |
|
|
|
|
|
|-----------|-----------|------------------------|----------|
|
|
|
|
|
| 10 MB | 25ms | 800ms | 3% |
|
|
|
|
|
| 100 MB | 250ms | 8s | 3% |
|
|
|
|
|
| 1 GB | 2.5s | 80s | 3% |
|
|
|
|
|
| 10 GB | 25s | 800s | 3% |
|
|
|
|
|
|
|
|
|
|
**Conclusion**: Hash computation adds ~3% overhead regardless of file size. Network I/O dominates.
|
|
|
|
|
|
|
|
|
|
### Existence Check Overhead
|
|
|
|
|
|
|
|
|
|
- S3 HEAD request: ~50-100ms per call
|
|
|
|
|
- Cached in future: Could use Redis/memory cache for hot paths
|
|
|
|
|
- Current MVP: No caching (acceptable for expected load)
|
|
|
|
|
|
|
|
|
|
### Deduplication Savings
|
|
|
|
|
|
|
|
|
|
Example with 50% duplication rate:
|
|
|
|
|
|
|
|
|
|
| Metric | Without Dedup | With Dedup | Savings |
|
|
|
|
|
|--------|---------------|------------|---------|
|
|
|
|
|
| Storage (100K files, 10MB avg) | 1 TB | 500 GB | 50% |
|
|
|
|
|
| Upload bandwidth | 1 TB | 500 GB | 50% |
|
|
|
|
|
| S3 costs | $23/mo | $11.50/mo | 50% |
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## Operations Runbook
|
|
|
|
|
|
|
|
|
|
### Monitoring Deduplication
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# View deduplication stats
|
|
|
|
|
curl http://orchard:8080/api/v1/stats/deduplication
|
|
|
|
|
|
|
|
|
|
# Response includes:
|
|
|
|
|
# - deduplication_ratio
|
|
|
|
|
# - total_uploads, deduplicated_uploads
|
|
|
|
|
# - bytes_saved
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Checking for Orphaned Artifacts
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# List orphaned artifacts (ref_count = 0)
|
|
|
|
|
curl http://orchard:8080/api/v1/admin/orphaned-artifacts
|
|
|
|
|
|
|
|
|
|
# Dry-run garbage collection
|
|
|
|
|
curl -X POST "http://orchard:8080/api/v1/admin/garbage-collect?dry_run=true"
|
|
|
|
|
|
|
|
|
|
# Execute garbage collection
|
|
|
|
|
curl -X POST "http://orchard:8080/api/v1/admin/garbage-collect?dry_run=false"
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Verifying Artifact Integrity
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# Download and verify hash matches artifact ID
|
|
|
|
|
ARTIFACT_ID="dffd6021bb2bd5b0af676290809ec3a53191dd81c7f70a4b28688a362182986f"
|
|
|
|
|
curl -O http://orchard:8080/api/v1/artifact/$ARTIFACT_ID/download
|
|
|
|
|
COMPUTED=$(sha256sum downloaded_file | cut -d' ' -f1)
|
|
|
|
|
[ "$ARTIFACT_ID" = "$COMPUTED" ] && echo "OK" || echo "INTEGRITY FAILURE"
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Troubleshooting
|
|
|
|
|
|
|
|
|
|
| Symptom | Likely Cause | Resolution |
|
|
|
|
|
|---------|--------------|------------|
|
|
|
|
|
| "Hash computation error" | Empty file or read error | Check file content, retry |
|
|
|
|
|
| "Storage unavailable" | S3/MinIO down | Check S3 health, retry |
|
|
|
|
|
| "File too large" | Exceeds max_file_size | Adjust config or use chunked upload |
|
|
|
|
|
| "Hash collision detected" | Extremely rare | Investigate, do not ignore |
|
|
|
|
|
| Orphaned artifacts accumulating | Tags deleted, no GC run | Run garbage collection |
|
|
|
|
|
| Download returns 404 | S3 object missing | Check S3 bucket, restore from backup |
|
|
|
|
|
|
|
|
|
|
### Configuration Reference
|
|
|
|
|
|
|
|
|
|
| Variable | Default | Description |
|
|
|
|
|
|----------|---------|-------------|
|
|
|
|
|
| `ORCHARD_MAX_FILE_SIZE` | 10GB | Maximum upload size |
|
|
|
|
|
| `ORCHARD_MIN_FILE_SIZE` | 1 | Minimum upload size (rejects empty) |
|
|
|
|
|
| `ORCHARD_S3_MAX_RETRIES` | 3 | Retry attempts for S3 operations |
|
|
|
|
|
| `ORCHARD_S3_CONNECT_TIMEOUT` | 10s | S3 connection timeout |
|
|
|
|
|
| `ORCHARD_S3_READ_TIMEOUT` | 60s | S3 read timeout |
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## Appendix: Decision Records
|
|
|
|
|
|
|
|
|
|
### ADR-001: SHA256 for Content Hashing
|
|
|
|
|
|
|
|
|
|
**Status**: Accepted
|
|
|
|
|
|
|
|
|
|
**Context**: Need deterministic content identifier for deduplication.
|
|
|
|
|
|
|
|
|
|
**Decision**: Use SHA256.
|
|
|
|
|
|
|
|
|
|
**Rationale**:
|
|
|
|
|
- Cryptographically strong (no known attacks)
|
|
|
|
|
- Universal adoption (Git, Docker, npm)
|
|
|
|
|
- Sufficient speed for I/O-bound workloads
|
|
|
|
|
- Excellent tooling
|
|
|
|
|
|
|
|
|
|
**Consequences**:
|
|
|
|
|
- 64-character artifact IDs (longer than UUIDs)
|
|
|
|
|
- CPU overhead ~3% of upload time
|
|
|
|
|
- Future algorithm migration requires versioning
|
|
|
|
|
|
|
|
|
|
### ADR-002: Whole-File Deduplication Only
|
|
|
|
|
|
|
|
|
|
**Status**: Accepted
|
|
|
|
|
|
|
|
|
|
**Context**: Could implement chunk-level deduplication for better savings.
|
|
|
|
|
|
|
|
|
|
**Decision**: Whole-file only for MVP.
|
|
|
|
|
|
|
|
|
|
**Rationale**:
|
|
|
|
|
- Simpler implementation
|
|
|
|
|
- No chunking algorithm complexity
|
|
|
|
|
- Sufficient for build artifact use case
|
|
|
|
|
- Can add chunk-level later if needed
|
|
|
|
|
|
|
|
|
|
**Consequences**:
|
|
|
|
|
- Files with partial overlap stored entirely
|
|
|
|
|
- Large files with small changes not deduplicated
|
|
|
|
|
- Acceptable for binary artifact workloads
|
|
|
|
|
|
|
|
|
|
### ADR-003: SQL Triggers for ref_count
|
|
|
|
|
|
|
|
|
|
**Status**: Accepted
|
|
|
|
|
|
|
|
|
|
**Context**: ref_count must be accurate for garbage collection.
|
|
|
|
|
|
|
|
|
|
**Decision**: Use PostgreSQL triggers, not application code.
|
|
|
|
|
|
|
|
|
|
**Rationale**:
|
|
|
|
|
- Atomic with tag operations
|
|
|
|
|
- Cannot be bypassed
|
|
|
|
|
- Works regardless of client (API, direct SQL, migrations)
|
|
|
|
|
- Simpler application code
|
|
|
|
|
|
|
|
|
|
**Consequences**:
|
|
|
|
|
- Trigger logic in SQL (less visible)
|
|
|
|
|
- Must maintain triggers across schema changes
|
|
|
|
|
- Debugging requires database access
|