Files
orchard/docs/epic-upstream-caching.md

26 KiB

Epic: Upstream Artifact Caching for Hermetic Builds

Overview

Orchard will act as a permanent, content-addressable cache for upstream artifacts (npm, PyPI, Maven, Docker, etc.). Once an artifact is cached, it is stored forever by SHA256 hash - enabling reproducible builds years later regardless of whether the upstream source still exists.

Problem Statement

Build reproducibility is critical for enterprise environments:

  • Packages get deleted, yanked, or modified upstream
  • Registries go down or change URLs
  • Version constraints resolve differently over time
  • Air-gapped environments cannot access public internet

Teams need to guarantee that a build from 5 years ago produces the exact same output today.

Solution

Orchard becomes "the cache that never forgets":

  1. Fetch once, store forever - When a build needs lodash@4.17.21, Orchard fetches it from npm, stores it by SHA256 hash, and never deletes it
  2. Content-addressable - Same hash = same bytes, guaranteed
  3. Format-agnostic - Orchard doesn't need to understand npm/PyPI/Maven protocols; the client provides the URL, Orchard fetches and stores
  4. Air-gap support - Disable public internet entirely, only allow configured private upstreams

User Workflow

1. Build tool resolves dependencies     npm install / pip install / mvn resolve
                ↓
2. Generate lockfile with URLs          package-lock.json / requirements.txt
                ↓
3. Cache all URLs in Orchard            orchard cache --file urls.txt
                ↓
4. Pin by SHA256 hash                   lodash = "sha256:abc123..."
                ↓
5. Future builds fetch by hash          Always get exact same bytes

Key Features

  • Multiple upstream sources - Configure npm, PyPI, Maven Central, private Artifactory, etc.
  • Per-source authentication - Basic auth, bearer tokens, API keys
  • System cache projects - _npm, _pypi, _maven organize cached packages by format
  • Cross-referencing - Link cached artifacts to user projects for visibility
  • URL tracking - Know which URLs map to which hashes, audit provenance
  • Air-gap mode - Global kill switch for all public internet access
  • Environment variable config - 12-factor friendly for containerized deployments

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         Orchard Server                          │
├─────────────────────────────────────────────────────────────────┤
│  POST /api/v1/cache                                             │
│    ├── Check if URL already cached (url_hash lookup)            │
│    ├── Match URL to upstream source (get auth)                  │
│    ├── Fetch via UpstreamClient (stream + compute SHA256)       │
│    ├── Store artifact in S3 (content-addressable)               │
│    ├── Create tag in system project (_npm/lodash:4.17.21)       │
│    ├── Optionally create tag in user project                    │
│    └── Record in cached_urls table (provenance)                 │
├─────────────────────────────────────────────────────────────────┤
│  Tables                                                         │
│    ├── upstream_sources (npm-public, pypi-public, artifactory)  │
│    ├── cache_settings (allow_public_internet, etc.)             │
│    ├── cached_urls (url → artifact_id mapping)                  │
│    └── projects.is_system (for _npm, _pypi, etc.)               │
└─────────────────────────────────────────────────────────────────┘

Issues Summary

Issue Title Status Dependencies
#68 Schema: Upstream Sources & Cache Tracking Complete None
#69 HTTP Client: Generic URL Fetcher Pending None
#70 Cache API Endpoint Pending #68, #69
#71 System Projects (Cache Namespaces) Pending #68, #70
#72 Upstream Sources Admin API Pending #68
#73 Global Cache Settings API Pending #68
#74 Environment Variable Overrides Pending #68, #72, #73
#75 Frontend: Upstream Sources Management Pending #72, #73
#105 Frontend: System Projects Integration Pending #71
#77 CLI: Cache Command Pending #70

Implementation Phases

Phase 1 - Core (MVP):

  • #68 Schema
  • #69 HTTP Client
  • #70 Cache API
  • #71 System Projects

Phase 2 - Admin:

  • #72 Upstream Sources API
  • #73 Cache Settings API
  • #74 Environment Variables

Phase 3 - Frontend:

  • #75 Upstream Sources UI
  • #105 System Projects UI

Phase 4 - CLI:

  • #77 Cache Command

Issue #68: Schema - Upstream Sources & Cache Tracking

Status: Complete

Description

Create database schema for flexible multi-source upstream configuration and URL-to-artifact tracking. This replaces the previous singleton proxy_config design with a more flexible model supporting multiple upstream sources, air-gap mode, and provenance tracking.

Acceptance Criteria

  • upstream_sources table:
    • id (UUID, primary key)
    • name (VARCHAR(255), unique, e.g., "npm-public", "artifactory-private")
    • source_type (VARCHAR(50), enum: npm, pypi, maven, docker, helm, nuget, deb, rpm, generic)
    • url (VARCHAR(2048), base URL of upstream)
    • enabled (BOOLEAN, default false)
    • is_public (BOOLEAN, true if this is a public internet source)
    • auth_type (VARCHAR(20), enum: none, basic, bearer, api_key)
    • username (VARCHAR(255), nullable)
    • password_encrypted (BYTEA, nullable, Fernet encrypted)
    • headers_encrypted (BYTEA, nullable, for custom headers like API keys)
    • priority (INTEGER, default 100, lower = checked first)
    • created_at, updated_at timestamps
  • cache_settings table (singleton, id always 1):
    • id (INTEGER, primary key, check id = 1)
    • allow_public_internet (BOOLEAN, default true, air-gap kill switch)
    • auto_create_system_projects (BOOLEAN, default true)
    • created_at, updated_at timestamps
  • cached_urls table:
    • id (UUID, primary key)
    • url (VARCHAR(4096), original URL fetched)
    • url_hash (VARCHAR(64), SHA256 of URL for fast lookup, indexed)
    • artifact_id (VARCHAR(64), FK to artifacts)
    • source_id (UUID, FK to upstream_sources, nullable for manual imports)
    • fetched_at (TIMESTAMP WITH TIME ZONE)
    • response_headers (JSONB, original upstream headers for provenance)
    • created_at timestamp
  • Add is_system BOOLEAN column to projects table (default false)
  • Migration SQL file in migrations/
  • Runtime migration in database.py
  • SQLAlchemy models for all new tables
  • Pydantic schemas for API input/output (passwords write-only)
  • Encryption helpers for password/headers fields
  • Seed default upstream sources (disabled by default):
  • Unit tests for models and schemas

Files Modified

  • migrations/010_upstream_caching.sql
  • backend/app/database.py (migrations 016-020)
  • backend/app/models.py (UpstreamSource, CacheSettings, CachedUrl, Project.is_system)
  • backend/app/schemas.py (all caching schemas)
  • backend/app/encryption.py (renamed env var)
  • backend/app/config.py (renamed setting)
  • backend/tests/test_upstream_caching.py (37 tests)
  • frontend/src/components/Layout.tsx (footer tagline)
  • CHANGELOG.md

Issue #69: HTTP Client - Generic URL Fetcher

Status: Pending

Description

Create a reusable HTTP client for fetching artifacts from upstream sources. Supports multiple auth methods, streaming for large files, and computes SHA256 while downloading.

Acceptance Criteria

  • UpstreamClient class in backend/app/upstream.py
  • fetch(url) method that:
    • Streams response body (doesn't load large files into memory)
    • Computes SHA256 hash while streaming
    • Returns file content, hash, size, and response headers
  • Auth support based on upstream source configuration:
    • None (anonymous)
    • Basic auth (username/password)
    • Bearer token (Authorization: Bearer {token})
    • API key (custom header name/value)
  • URL-to-source matching:
    • Match URL to configured upstream source by URL prefix
    • Apply auth from matched source
    • Respect source priority for multiple matches
  • Configuration options:
    • Timeout (connect and read, default 30s/300s)
    • Max retries (default 3)
    • Follow redirects (default true, max 5)
    • Max file size (reject if Content-Length exceeds limit)
  • Respect allow_public_internet setting:
    • If false, reject URLs matching is_public=true sources
    • If false, reject URLs not matching any configured source
  • Capture response headers for provenance tracking
  • Proper error handling:
    • Connection errors (retry with backoff)
    • HTTP errors (4xx, 5xx)
    • Timeout errors
    • SSL/TLS errors
  • Logging for debugging (URL, source matched, status, timing)
  • Unit tests with mocked HTTP responses
  • Integration tests against httpbin.org or similar (optional, marked)

Technical Notes

  • Use httpx for async HTTP support (already in requirements)
  • Stream to temp file to avoid memory issues with large artifacts
  • Consider checksum verification if upstream provides it (e.g., npm provides shasum)

Issue #70: Cache API Endpoint

Status: Pending

Description

API endpoint to cache an artifact from an upstream URL. This is the core endpoint that fetches from upstream, stores in Orchard, and creates appropriate tags.

Acceptance Criteria

  • POST /api/v1/cache endpoint
  • Request body:
    {
      "url": "https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz",
      "source_type": "npm",
      "package_name": "lodash",
      "tag": "4.17.21",
      "user_project": "my-app",
      "user_package": "npm-deps",
      "user_tag": "lodash-4.17.21",
      "expected_hash": "sha256:abc123..."
    }
    
    • url (required): URL to fetch
    • source_type (required): Determines system project (_npm, _pypi, etc.)
    • package_name (optional): Package name in system project, derived from URL if not provided
    • tag (optional): Tag name in system project, derived from URL if not provided
    • user_project, user_package, user_tag (optional): Cross-reference in user's project
    • expected_hash (optional): Verify downloaded content matches
  • Response:
    {
      "artifact_id": "abc123...",
      "sha256": "abc123...",
      "size": 12345,
      "content_type": "application/gzip",
      "already_cached": false,
      "source_url": "https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz",
      "source_name": "npm-public",
      "system_project": "_npm",
      "system_package": "lodash",
      "system_tag": "4.17.21",
      "user_reference": "my-app/npm-deps:lodash-4.17.21"
    }
    
  • Behavior:
    • Check if URL already cached (by url_hash in cached_urls)
    • If cached: return existing artifact, optionally create user tag
    • If not cached: fetch via UpstreamClient, store artifact, create tags
    • Create/get system project if needed (e.g., _npm)
    • Create package in system project (e.g., _npm/lodash)
    • Create tag in system project (e.g., _npm/lodash:4.17.21)
    • If user reference provided, create tag in user's project
    • Record in cached_urls table with provenance
  • Error handling:
    • 400: Invalid request (bad URL format, missing required fields)
    • 403: Air-gap mode enabled and URL is from public source
    • 404: Upstream returned 404
    • 409: Hash mismatch (if expected_hash provided)
    • 502: Upstream fetch failed (connection error, timeout)
    • 503: Upstream source disabled
  • Authentication required (any authenticated user can cache)
  • Audit logging for cache operations
  • Integration tests covering success and error cases

Technical Notes

  • URL parsing for package_name/tag derivation is format-specific:
    • npm: /{package}/-/{package}-{version}.tgz → package=lodash, tag=4.17.21
    • pypi: /packages/.../requests-2.28.0.tar.gz → package=requests, tag=2.28.0
    • maven: /{group}/{artifact}/{version}/{artifact}-{version}.jar
  • Deduplication: if same SHA256 already exists, just create new tag pointing to it

Issue #71: System Projects (Cache Namespaces)

Status: Pending

Description

Implement auto-created system projects for organizing cached artifacts by format type. These are special projects that provide a browsable namespace for all cached upstream packages.

Acceptance Criteria

  • System project names: _npm, _pypi, _maven, _docker, _helm, _nuget, _deb, _rpm, _generic
  • Auto-creation:
    • Created automatically on first cache request for that format
    • Created by cache endpoint, not at startup
    • Uses system user as creator (created_by = "system")
  • System project properties:
    • is_system = true
    • is_public = true (readable by all authenticated users)
    • description = "System cache for {format} packages"
  • Restrictions:
    • Cannot be deleted (return 403 with message)
    • Cannot be renamed
    • Cannot change is_public to false
    • Only admins can modify description
  • Helper function: get_or_create_system_project(source_type) in routes.py or new cache.py module
  • Update project deletion endpoint to check is_system flag
  • Update project update endpoint to enforce restrictions
  • Query helper: list all system projects for UI dropdown
  • Unit tests for restrictions
  • Integration tests for auto-creation and restrictions

Technical Notes

  • System projects are identified by is_system=true, not just naming convention
  • The _ prefix is a convention for display purposes
  • Packages within system projects follow upstream naming (e.g., _npm/lodash, _npm/@types/node)

Issue #72: Upstream Sources Admin API

Status: Pending

Description

CRUD API endpoints for managing upstream sources configuration. Admin-only access.

Acceptance Criteria

  • GET /api/v1/admin/upstream-sources - List all upstream sources
    • Returns array of sources with id, name, source_type, url, enabled, is_public, auth_type, priority, has_credentials, created_at, updated_at
    • Supports ?enabled=true/false filter
    • Supports ?source_type=npm,pypi filter
    • Passwords/tokens never returned
  • POST /api/v1/admin/upstream-sources - Create upstream source
    • Request: name, source_type, url, enabled, is_public, auth_type, username, password, headers, priority
    • Validates unique name
    • Validates URL format
    • Encrypts password/headers before storage
    • Returns created source (without secrets)
  • GET /api/v1/admin/upstream-sources/{id} - Get source details
    • Returns source with has_credentials boolean, not actual credentials
  • PUT /api/v1/admin/upstream-sources/{id} - Update source
    • Partial update supported
    • If password provided, re-encrypt; if omitted, keep existing
    • Special value password: null clears credentials
  • DELETE /api/v1/admin/upstream-sources/{id} - Delete source
    • Returns 400 if source has cached_urls referencing it (optional: cascade or reassign)
  • POST /api/v1/admin/upstream-sources/{id}/test - Test connectivity
    • Attempts HEAD request to source URL
    • Returns success/failure with status code and timing
    • Does not cache anything
  • All endpoints require admin role
  • Audit logging for all mutations
  • Pydantic schemas: UpstreamSourceCreate, UpstreamSourceUpdate, UpstreamSourceResponse
  • Integration tests for all endpoints

Technical Notes

  • Test endpoint should respect auth configuration to verify credentials work
  • Consider adding last_used_at and last_error fields for observability (future enhancement)

Issue #73: Global Cache Settings API

Status: Pending

Description

API endpoints for managing global cache settings including air-gap mode.

Acceptance Criteria

  • GET /api/v1/admin/cache-settings - Get current settings
    • Returns: allow_public_internet, auto_create_system_projects, created_at, updated_at
  • PUT /api/v1/admin/cache-settings - Update settings
    • Partial update supported
    • Returns updated settings
  • Settings fields:
    • allow_public_internet (boolean): When false, blocks all requests to sources marked is_public=true
    • auto_create_system_projects (boolean): When false, system projects must be created manually
  • Admin-only access
  • Audit logging for changes (especially air-gap mode changes)
  • Pydantic schemas: CacheSettingsResponse, CacheSettingsUpdate
  • Initialize singleton row on first access if not exists
  • Integration tests

Technical Notes

  • Air-gap mode change should be logged prominently (security-relevant)
  • Consider requiring confirmation header for disabling air-gap mode (similar to factory reset)

Issue #74: Environment Variable Overrides

Status: Pending

Description

Allow cache and upstream configuration via environment variables for containerized deployments. Environment variables override database settings following 12-factor app principles.

Acceptance Criteria

  • Global settings overrides:
    • ORCHARD_CACHE_ALLOW_PUBLIC_INTERNET=true/false
    • ORCHARD_CACHE_AUTO_CREATE_SYSTEM_PROJECTS=true/false
    • ORCHARD_CACHE_ENCRYPTION_KEY (Fernet key for credential encryption)
  • Upstream source definition via env vars:
    • ORCHARD_UPSTREAM__{NAME}__URL (double underscore as separator)
    • ORCHARD_UPSTREAM__{NAME}__TYPE (npm, pypi, maven, etc.)
    • ORCHARD_UPSTREAM__{NAME}__ENABLED (true/false)
    • ORCHARD_UPSTREAM__{NAME}__IS_PUBLIC (true/false)
    • ORCHARD_UPSTREAM__{NAME}__AUTH_TYPE (none, basic, bearer, api_key)
    • ORCHARD_UPSTREAM__{NAME}__USERNAME
    • ORCHARD_UPSTREAM__{NAME}__PASSWORD
    • ORCHARD_UPSTREAM__{NAME}__PRIORITY
    • Example: ORCHARD_UPSTREAM__NPM_PRIVATE__URL=https://npm.corp.com
  • Env var sources:
    • Loaded at startup
    • Merged with database sources
    • Env var sources have source = "env" marker
    • Cannot be modified via API (return 400)
    • Cannot be deleted via API (return 400)
  • Update Settings class in config.py
  • Update get/list endpoints to include env-defined sources
  • Document all env vars in CLAUDE.md
  • Unit tests for env var parsing
  • Integration tests with env vars set

Technical Notes

  • Double underscore (__) separator allows source names with single underscores
  • Env-defined sources should appear in API responses but marked as read-only
  • Consider startup validation that warns about invalid env var combinations

Issue #75: Frontend - Upstream Sources Management

Status: Pending

Description

Admin UI for managing upstream sources and cache settings.

Acceptance Criteria

  • New admin page: /admin/cache or /admin/upstream-sources
  • Upstream sources section:
    • Table listing all sources with: name, type, URL, enabled toggle, public badge, priority, actions
    • Visual distinction for env-defined sources (locked icon, no edit/delete)
    • Create button opens modal/form
    • Edit button for DB-defined sources
    • Delete with confirmation modal
    • Test connection button with status indicator
  • Create/edit form fields:
    • Name (text, required)
    • Source type (dropdown)
    • URL (text, required)
    • Priority (number)
    • Is public (checkbox)
    • Enabled (checkbox)
    • Auth type (dropdown: none, basic, bearer, api_key)
    • Conditional auth fields based on type:
      • Basic: username, password
      • Bearer: token
      • API key: header name, header value
    • Password fields masked, "unchanged" placeholder on edit
  • Cache settings section:
    • Air-gap mode toggle with warning
    • Auto-create system projects toggle
    • "Air-gap mode" shows prominent warning banner when enabled
  • Link from main admin navigation
  • Loading and error states
  • Success/error toast notifications

Technical Notes

  • Use existing admin page patterns from user management
  • Air-gap toggle should require confirmation (modal with warning text)

Issue #105: Frontend - System Projects Integration

Status: Pending

Description

Integrate system projects into the frontend UI with appropriate visual treatment and navigation.

Acceptance Criteria

  • Home page project dropdown:
    • System projects shown in separate "Cached Packages" section
    • Visual distinction (icon, different background, or badge)
    • Format icon for each type (npm, pypi, maven, etc.)
  • Project list/grid:
    • System projects can be filtered: "Show system projects" toggle
    • Or separate tab: "Projects" | "Package Cache"
  • System project page:
    • "System Cache" badge in header
    • Description explains this is auto-managed cache
    • Settings/delete buttons hidden or disabled
    • Shows format type prominently
  • Package page within system project:
    • Shows "Cached from" with source URL (linked)
    • Shows "First cached" timestamp
    • Shows which upstream source provided it
  • Artifact page:
    • If artifact came from cache, show provenance:
      • Original URL
      • Upstream source name
      • Fetch timestamp
  • Search includes system projects (with filter option)

Technical Notes

  • Use React context or query params for system project filtering
  • Consider dedicated route: /cache/npm/lodash as alias for /_npm/lodash

Issue #77: CLI - Cache Command

Status: Pending

Description

Add a new orchard cache command to the existing CLI for caching artifacts from upstream URLs. This integrates with the new cache API endpoint and can optionally update orchard.ensure with cached artifacts.

Acceptance Criteria

  • New command: orchard cache <url> in orchard/commands/cache.py
  • Basic usage:
    # Cache a URL, print artifact info
    orchard cache https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz
    
    # Output:
    # Caching https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz...
    #   Source type: npm
    #   Package: lodash
    #   Version: 4.17.21
    #
    # Successfully cached artifact
    #   Artifact ID: abc123...
    #   Size: 1.2 MB
    #   System project: _npm
    #   System package: lodash
    #   System tag: 4.17.21
    
  • Options:
    Option Description
    --type, -t TYPE Source type: npm, pypi, maven, docker, helm, generic (auto-detected from URL if not provided)
    --package, -p NAME Package name in system project (auto-derived from URL if not provided)
    --tag TAG Tag name in system project (auto-derived from URL if not provided)
    --project PROJECT Also create tag in this user project
    --user-package PKG Package name in user project (required if --project specified)
    --user-tag TAG Tag name in user project (default: same as system tag)
    --expected-hash HASH Verify downloaded content matches this SHA256
    --add Add to orchard.ensure after caching
    --add-path PATH Extraction path for --add (default: <package>/)
    --file, -f FILE Path to orchard.ensure file
    --verbose, -v Show detailed output
  • URL type auto-detection:
    • registry.npmjs.org → npm
    • pypi.org or files.pythonhosted.org → pypi
    • repo1.maven.org or contains /maven2/ → maven
    • registry-1.docker.io or docker.io → docker
    • Otherwise → generic
  • Package/version extraction from URL patterns:
    • npm: /{package}/-/{package}-{version}.tgz
    • pypi: /packages/.../requests-{version}.tar.gz
    • maven: /{group}/{artifact}/{version}/{artifact}-{version}.jar
  • Add cache_artifact() function to orchard/api.py
  • Integration with --add flag:
    • Parse existing orchard.ensure
    • Add new dependency entry pointing to cached artifact
    • Use artifact_id (SHA256) for hermetic pinning
  • Batch mode: orchard cache --file urls.txt
    • One URL per line
    • Lines starting with # are comments
    • Report success/failure for each
  • Exit codes:
    • 0: Success (or already cached)
    • 1: Fetch failed
    • 2: Hash mismatch
    • 3: Air-gap mode blocked request
  • Error handling consistent with existing CLI patterns
  • Unit tests in test/test_cache.py
  • Update README.md with cache command documentation

Technical Notes

  • Follow existing Click patterns from other commands
  • Use get_auth_headers() from orchard/auth.py
  • URL parsing can use urllib.parse
  • Consider adding URL pattern registry for extensibility
  • The --add flag should integrate with existing ensure file parsing in orchard/ensure.py

Example Workflows

# Simple: cache a single URL
orchard cache https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz

# Cache and add to orchard.ensure for current project
orchard cache https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz \
  --add --add-path libs/lodash/

# Cache with explicit metadata
orchard cache https://internal.corp/files/custom-lib.tar.gz \
  --type generic \
  --package custom-lib \
  --tag v1.0.0

# Cache and cross-reference to user project
orchard cache https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz \
  --project my-app \
  --user-package npm-deps \
  --user-tag lodash-4.17.21

# Batch cache from file
orchard cache --file deps-urls.txt

# Verify hash while caching
orchard cache https://example.com/file.tar.gz \
  --expected-hash sha256:abc123...

Out of Scope (Future Enhancements)

  • Automatic transitive dependency resolution (client's responsibility)
  • Lockfile parsing (package-lock.json, requirements.txt) - stretch goal for CLI
  • Cache eviction policies (we cache forever by design)
  • Mirroring/sync between Orchard instances
  • Format-specific metadata extraction (npm package.json parsing, etc.)

Success Criteria

  • Can cache any URL and retrieve by SHA256 hash
  • Cached artifacts persist indefinitely
  • Air-gap mode blocks all public internet access
  • Multiple upstream sources with different auth
  • System projects organize cached packages by format
  • CLI can cache URLs and update orchard.ensure
  • Admin UI for upstream source management