# Epic: Upstream Artifact Caching for Hermetic Builds ## Overview Orchard will act as a permanent, content-addressable cache for upstream artifacts (npm, PyPI, Maven, Docker, etc.). Once an artifact is cached, it is stored forever by SHA256 hash - enabling reproducible builds years later regardless of whether the upstream source still exists. ## Problem Statement Build reproducibility is critical for enterprise environments: - Packages get deleted, yanked, or modified upstream - Registries go down or change URLs - Version constraints resolve differently over time - Air-gapped environments cannot access public internet Teams need to guarantee that a build from 5 years ago produces the exact same output today. ## Solution Orchard becomes "the cache that never forgets": 1. **Fetch once, store forever** - When a build needs `lodash@4.17.21`, Orchard fetches it from npm, stores it by SHA256 hash, and never deletes it 2. **Content-addressable** - Same hash = same bytes, guaranteed 3. **Format-agnostic** - Orchard doesn't need to understand npm/PyPI/Maven protocols; the client provides the URL, Orchard fetches and stores 4. **Air-gap support** - Disable public internet entirely, only allow configured private upstreams ## User Workflow ``` 1. Build tool resolves dependencies npm install / pip install / mvn resolve ↓ 2. Generate lockfile with URLs package-lock.json / requirements.txt ↓ 3. Cache all URLs in Orchard orchard cache --file urls.txt ↓ 4. Pin by SHA256 hash lodash = "sha256:abc123..." ↓ 5. Future builds fetch by hash Always get exact same bytes ``` ## Key Features - **Multiple upstream sources** - Configure npm, PyPI, Maven Central, private Artifactory, etc. - **Per-source authentication** - Basic auth, bearer tokens, API keys - **System cache projects** - `_npm`, `_pypi`, `_maven` organize cached packages by format - **Cross-referencing** - Link cached artifacts to user projects for visibility - **URL tracking** - Know which URLs map to which hashes, audit provenance - **Air-gap mode** - Global kill switch for all public internet access - **Environment variable config** - 12-factor friendly for containerized deployments ## Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ Orchard Server │ ├─────────────────────────────────────────────────────────────────┤ │ POST /api/v1/cache │ │ ├── Check if URL already cached (url_hash lookup) │ │ ├── Match URL to upstream source (get auth) │ │ ├── Fetch via UpstreamClient (stream + compute SHA256) │ │ ├── Store artifact in S3 (content-addressable) │ │ ├── Create tag in system project (_npm/lodash:4.17.21) │ │ ├── Optionally create tag in user project │ │ └── Record in cached_urls table (provenance) │ ├─────────────────────────────────────────────────────────────────┤ │ Tables │ │ ├── upstream_sources (npm-public, pypi-public, artifactory) │ │ ├── cache_settings (allow_public_internet, etc.) │ │ ├── cached_urls (url → artifact_id mapping) │ │ └── projects.is_system (for _npm, _pypi, etc.) │ └─────────────────────────────────────────────────────────────────┘ ``` ## Issues Summary | Issue | Title | Status | Dependencies | |-------|-------|--------|--------------| | #68 | Schema: Upstream Sources & Cache Tracking | ✅ Complete | None | | #69 | HTTP Client: Generic URL Fetcher | Pending | None | | #70 | Cache API Endpoint | Pending | #68, #69 | | #71 | System Projects (Cache Namespaces) | Pending | #68, #70 | | #72 | Upstream Sources Admin API | Pending | #68 | | #73 | Global Cache Settings API | Pending | #68 | | #74 | Environment Variable Overrides | Pending | #68, #72, #73 | | #75 | Frontend: Upstream Sources Management | Pending | #72, #73 | | #105 | Frontend: System Projects Integration | Pending | #71 | | #77 | CLI: Cache Command | Pending | #70 | ## Implementation Phases **Phase 1 - Core (MVP):** - #68 Schema ✅ - #69 HTTP Client - #70 Cache API - #71 System Projects **Phase 2 - Admin:** - #72 Upstream Sources API - #73 Cache Settings API - #74 Environment Variables **Phase 3 - Frontend:** - #75 Upstream Sources UI - #105 System Projects UI **Phase 4 - CLI:** - #77 Cache Command --- # Issue #68: Schema - Upstream Sources & Cache Tracking **Status: ✅ Complete** ## Description Create database schema for flexible multi-source upstream configuration and URL-to-artifact tracking. This replaces the previous singleton proxy_config design with a more flexible model supporting multiple upstream sources, air-gap mode, and provenance tracking. ## Acceptance Criteria - [x] `upstream_sources` table: - id (UUID, primary key) - name (VARCHAR(255), unique, e.g., "npm-public", "artifactory-private") - source_type (VARCHAR(50), enum: npm, pypi, maven, docker, helm, nuget, deb, rpm, generic) - url (VARCHAR(2048), base URL of upstream) - enabled (BOOLEAN, default false) - is_public (BOOLEAN, true if this is a public internet source) - auth_type (VARCHAR(20), enum: none, basic, bearer, api_key) - username (VARCHAR(255), nullable) - password_encrypted (BYTEA, nullable, Fernet encrypted) - headers_encrypted (BYTEA, nullable, for custom headers like API keys) - priority (INTEGER, default 100, lower = checked first) - created_at, updated_at timestamps - [x] `cache_settings` table (singleton, id always 1): - id (INTEGER, primary key, check id = 1) - allow_public_internet (BOOLEAN, default true, air-gap kill switch) - auto_create_system_projects (BOOLEAN, default true) - created_at, updated_at timestamps - [x] `cached_urls` table: - id (UUID, primary key) - url (VARCHAR(4096), original URL fetched) - url_hash (VARCHAR(64), SHA256 of URL for fast lookup, indexed) - artifact_id (VARCHAR(64), FK to artifacts) - source_id (UUID, FK to upstream_sources, nullable for manual imports) - fetched_at (TIMESTAMP WITH TIME ZONE) - response_headers (JSONB, original upstream headers for provenance) - created_at timestamp - [x] Add `is_system` BOOLEAN column to projects table (default false) - [x] Migration SQL file in migrations/ - [x] Runtime migration in database.py - [x] SQLAlchemy models for all new tables - [x] Pydantic schemas for API input/output (passwords write-only) - [x] Encryption helpers for password/headers fields - [x] Seed default upstream sources (disabled by default): - npm-public: https://registry.npmjs.org - pypi-public: https://pypi.org/simple - maven-central: https://repo1.maven.org/maven2 - docker-hub: https://registry-1.docker.io - [x] Unit tests for models and schemas ## Files Modified - `migrations/010_upstream_caching.sql` - `backend/app/database.py` (migrations 016-020) - `backend/app/models.py` (UpstreamSource, CacheSettings, CachedUrl, Project.is_system) - `backend/app/schemas.py` (all caching schemas) - `backend/app/encryption.py` (renamed env var) - `backend/app/config.py` (renamed setting) - `backend/tests/test_upstream_caching.py` (37 tests) - `frontend/src/components/Layout.tsx` (footer tagline) - `CHANGELOG.md` --- # Issue #69: HTTP Client - Generic URL Fetcher **Status: Pending** ## Description Create a reusable HTTP client for fetching artifacts from upstream sources. Supports multiple auth methods, streaming for large files, and computes SHA256 while downloading. ## Acceptance Criteria - [ ] `UpstreamClient` class in `backend/app/upstream.py` - [ ] `fetch(url)` method that: - Streams response body (doesn't load large files into memory) - Computes SHA256 hash while streaming - Returns file content, hash, size, and response headers - [ ] Auth support based on upstream source configuration: - None (anonymous) - Basic auth (username/password) - Bearer token (Authorization: Bearer {token}) - API key (custom header name/value) - [ ] URL-to-source matching: - Match URL to configured upstream source by URL prefix - Apply auth from matched source - Respect source priority for multiple matches - [ ] Configuration options: - Timeout (connect and read, default 30s/300s) - Max retries (default 3) - Follow redirects (default true, max 5) - Max file size (reject if Content-Length exceeds limit) - [ ] Respect `allow_public_internet` setting: - If false, reject URLs matching `is_public=true` sources - If false, reject URLs not matching any configured source - [ ] Capture response headers for provenance tracking - [ ] Proper error handling: - Connection errors (retry with backoff) - HTTP errors (4xx, 5xx) - Timeout errors - SSL/TLS errors - [ ] Logging for debugging (URL, source matched, status, timing) - [ ] Unit tests with mocked HTTP responses - [ ] Integration tests against httpbin.org or similar (optional, marked) ## Technical Notes - Use `httpx` for async HTTP support (already in requirements) - Stream to temp file to avoid memory issues with large artifacts - Consider checksum verification if upstream provides it (e.g., npm provides shasum) --- # Issue #70: Cache API Endpoint **Status: Pending** ## Description API endpoint to cache an artifact from an upstream URL. This is the core endpoint that fetches from upstream, stores in Orchard, and creates appropriate tags. ## Acceptance Criteria - [ ] `POST /api/v1/cache` endpoint - [ ] Request body: ```json { "url": "https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz", "source_type": "npm", "package_name": "lodash", "tag": "4.17.21", "user_project": "my-app", "user_package": "npm-deps", "user_tag": "lodash-4.17.21", "expected_hash": "sha256:abc123..." } ``` - `url` (required): URL to fetch - `source_type` (required): Determines system project (_npm, _pypi, etc.) - `package_name` (optional): Package name in system project, derived from URL if not provided - `tag` (optional): Tag name in system project, derived from URL if not provided - `user_project`, `user_package`, `user_tag` (optional): Cross-reference in user's project - `expected_hash` (optional): Verify downloaded content matches - [ ] Response: ```json { "artifact_id": "abc123...", "sha256": "abc123...", "size": 12345, "content_type": "application/gzip", "already_cached": false, "source_url": "https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz", "source_name": "npm-public", "system_project": "_npm", "system_package": "lodash", "system_tag": "4.17.21", "user_reference": "my-app/npm-deps:lodash-4.17.21" } ``` - [ ] Behavior: - Check if URL already cached (by url_hash in cached_urls) - If cached: return existing artifact, optionally create user tag - If not cached: fetch via UpstreamClient, store artifact, create tags - Create/get system project if needed (e.g., `_npm`) - Create package in system project (e.g., `_npm/lodash`) - Create tag in system project (e.g., `_npm/lodash:4.17.21`) - If user reference provided, create tag in user's project - Record in cached_urls table with provenance - [ ] Error handling: - 400: Invalid request (bad URL format, missing required fields) - 403: Air-gap mode enabled and URL is from public source - 404: Upstream returned 404 - 409: Hash mismatch (if expected_hash provided) - 502: Upstream fetch failed (connection error, timeout) - 503: Upstream source disabled - [ ] Authentication required (any authenticated user can cache) - [ ] Audit logging for cache operations - [ ] Integration tests covering success and error cases ## Technical Notes - URL parsing for package_name/tag derivation is format-specific: - npm: `/{package}/-/{package}-{version}.tgz` → package=lodash, tag=4.17.21 - pypi: `/packages/.../requests-2.28.0.tar.gz` → package=requests, tag=2.28.0 - maven: `/{group}/{artifact}/{version}/{artifact}-{version}.jar` - Deduplication: if same SHA256 already exists, just create new tag pointing to it --- # Issue #71: System Projects (Cache Namespaces) **Status: Pending** ## Description Implement auto-created system projects for organizing cached artifacts by format type. These are special projects that provide a browsable namespace for all cached upstream packages. ## Acceptance Criteria - [ ] System project names: `_npm`, `_pypi`, `_maven`, `_docker`, `_helm`, `_nuget`, `_deb`, `_rpm`, `_generic` - [ ] Auto-creation: - Created automatically on first cache request for that format - Created by cache endpoint, not at startup - Uses system user as creator (`created_by = "system"`) - [ ] System project properties: - `is_system = true` - `is_public = true` (readable by all authenticated users) - `description` = "System cache for {format} packages" - [ ] Restrictions: - Cannot be deleted (return 403 with message) - Cannot be renamed - Cannot change `is_public` to false - Only admins can modify description - [ ] Helper function: `get_or_create_system_project(source_type)` in routes.py or new cache.py module - [ ] Update project deletion endpoint to check `is_system` flag - [ ] Update project update endpoint to enforce restrictions - [ ] Query helper: list all system projects for UI dropdown - [ ] Unit tests for restrictions - [ ] Integration tests for auto-creation and restrictions ## Technical Notes - System projects are identified by `is_system=true`, not just naming convention - The `_` prefix is a convention for display purposes - Packages within system projects follow upstream naming (e.g., `_npm/lodash`, `_npm/@types/node`) --- # Issue #72: Upstream Sources Admin API **Status: Pending** ## Description CRUD API endpoints for managing upstream sources configuration. Admin-only access. ## Acceptance Criteria - [ ] `GET /api/v1/admin/upstream-sources` - List all upstream sources - Returns array of sources with id, name, source_type, url, enabled, is_public, auth_type, priority, has_credentials, created_at, updated_at - Supports `?enabled=true/false` filter - Supports `?source_type=npm,pypi` filter - Passwords/tokens never returned - [ ] `POST /api/v1/admin/upstream-sources` - Create upstream source - Request: name, source_type, url, enabled, is_public, auth_type, username, password, headers, priority - Validates unique name - Validates URL format - Encrypts password/headers before storage - Returns created source (without secrets) - [ ] `GET /api/v1/admin/upstream-sources/{id}` - Get source details - Returns source with `has_credentials` boolean, not actual credentials - [ ] `PUT /api/v1/admin/upstream-sources/{id}` - Update source - Partial update supported - If password provided, re-encrypt; if omitted, keep existing - Special value `password: null` clears credentials - [ ] `DELETE /api/v1/admin/upstream-sources/{id}` - Delete source - Returns 400 if source has cached_urls referencing it (optional: cascade or reassign) - [ ] `POST /api/v1/admin/upstream-sources/{id}/test` - Test connectivity - Attempts HEAD request to source URL - Returns success/failure with status code and timing - Does not cache anything - [ ] All endpoints require admin role - [ ] Audit logging for all mutations - [ ] Pydantic schemas: UpstreamSourceCreate, UpstreamSourceUpdate, UpstreamSourceResponse - [ ] Integration tests for all endpoints ## Technical Notes - Test endpoint should respect auth configuration to verify credentials work - Consider adding `last_used_at` and `last_error` fields for observability (future enhancement) --- # Issue #73: Global Cache Settings API **Status: Pending** ## Description API endpoints for managing global cache settings including air-gap mode. ## Acceptance Criteria - [ ] `GET /api/v1/admin/cache-settings` - Get current settings - Returns: allow_public_internet, auto_create_system_projects, created_at, updated_at - [ ] `PUT /api/v1/admin/cache-settings` - Update settings - Partial update supported - Returns updated settings - [ ] Settings fields: - `allow_public_internet` (boolean): When false, blocks all requests to sources marked `is_public=true` - `auto_create_system_projects` (boolean): When false, system projects must be created manually - [ ] Admin-only access - [ ] Audit logging for changes (especially air-gap mode changes) - [ ] Pydantic schemas: CacheSettingsResponse, CacheSettingsUpdate - [ ] Initialize singleton row on first access if not exists - [ ] Integration tests ## Technical Notes - Air-gap mode change should be logged prominently (security-relevant) - Consider requiring confirmation header for disabling air-gap mode (similar to factory reset) --- # Issue #74: Environment Variable Overrides **Status: Pending** ## Description Allow cache and upstream configuration via environment variables for containerized deployments. Environment variables override database settings following 12-factor app principles. ## Acceptance Criteria - [ ] Global settings overrides: - `ORCHARD_CACHE_ALLOW_PUBLIC_INTERNET=true/false` - `ORCHARD_CACHE_AUTO_CREATE_SYSTEM_PROJECTS=true/false` - `ORCHARD_CACHE_ENCRYPTION_KEY` (Fernet key for credential encryption) - [ ] Upstream source definition via env vars: - `ORCHARD_UPSTREAM__{NAME}__URL` (double underscore as separator) - `ORCHARD_UPSTREAM__{NAME}__TYPE` (npm, pypi, maven, etc.) - `ORCHARD_UPSTREAM__{NAME}__ENABLED` (true/false) - `ORCHARD_UPSTREAM__{NAME}__IS_PUBLIC` (true/false) - `ORCHARD_UPSTREAM__{NAME}__AUTH_TYPE` (none, basic, bearer, api_key) - `ORCHARD_UPSTREAM__{NAME}__USERNAME` - `ORCHARD_UPSTREAM__{NAME}__PASSWORD` - `ORCHARD_UPSTREAM__{NAME}__PRIORITY` - Example: `ORCHARD_UPSTREAM__NPM_PRIVATE__URL=https://npm.corp.com` - [ ] Env var sources: - Loaded at startup - Merged with database sources - Env var sources have `source = "env"` marker - Cannot be modified via API (return 400) - Cannot be deleted via API (return 400) - [ ] Update Settings class in config.py - [ ] Update get/list endpoints to include env-defined sources - [ ] Document all env vars in CLAUDE.md - [ ] Unit tests for env var parsing - [ ] Integration tests with env vars set ## Technical Notes - Double underscore (`__`) separator allows source names with single underscores - Env-defined sources should appear in API responses but marked as read-only - Consider startup validation that warns about invalid env var combinations --- # Issue #75: Frontend - Upstream Sources Management **Status: Pending** ## Description Admin UI for managing upstream sources and cache settings. ## Acceptance Criteria - [ ] New admin page: `/admin/cache` or `/admin/upstream-sources` - [ ] Upstream sources section: - Table listing all sources with: name, type, URL, enabled toggle, public badge, priority, actions - Visual distinction for env-defined sources (locked icon, no edit/delete) - Create button opens modal/form - Edit button for DB-defined sources - Delete with confirmation modal - Test connection button with status indicator - [ ] Create/edit form fields: - Name (text, required) - Source type (dropdown) - URL (text, required) - Priority (number) - Is public (checkbox) - Enabled (checkbox) - Auth type (dropdown: none, basic, bearer, api_key) - Conditional auth fields based on type: - Basic: username, password - Bearer: token - API key: header name, header value - Password fields masked, "unchanged" placeholder on edit - [ ] Cache settings section: - Air-gap mode toggle with warning - Auto-create system projects toggle - "Air-gap mode" shows prominent warning banner when enabled - [ ] Link from main admin navigation - [ ] Loading and error states - [ ] Success/error toast notifications ## Technical Notes - Use existing admin page patterns from user management - Air-gap toggle should require confirmation (modal with warning text) --- # Issue #105: Frontend - System Projects Integration **Status: Pending** ## Description Integrate system projects into the frontend UI with appropriate visual treatment and navigation. ## Acceptance Criteria - [ ] Home page project dropdown: - System projects shown in separate "Cached Packages" section - Visual distinction (icon, different background, or badge) - Format icon for each type (npm, pypi, maven, etc.) - [ ] Project list/grid: - System projects can be filtered: "Show system projects" toggle - Or separate tab: "Projects" | "Package Cache" - [ ] System project page: - "System Cache" badge in header - Description explains this is auto-managed cache - Settings/delete buttons hidden or disabled - Shows format type prominently - [ ] Package page within system project: - Shows "Cached from" with source URL (linked) - Shows "First cached" timestamp - Shows which upstream source provided it - [ ] Artifact page: - If artifact came from cache, show provenance: - Original URL - Upstream source name - Fetch timestamp - [ ] Search includes system projects (with filter option) ## Technical Notes - Use React context or query params for system project filtering - Consider dedicated route: `/cache/npm/lodash` as alias for `/_npm/lodash` --- # Issue #77: CLI - Cache Command **Status: Pending** ## Description Add a new `orchard cache` command to the existing CLI for caching artifacts from upstream URLs. This integrates with the new cache API endpoint and can optionally update `orchard.ensure` with cached artifacts. ## Acceptance Criteria - [ ] New command: `orchard cache ` in `orchard/commands/cache.py` - [ ] Basic usage: ```bash # Cache a URL, print artifact info orchard cache https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz # Output: # Caching https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz... # Source type: npm # Package: lodash # Version: 4.17.21 # # Successfully cached artifact # Artifact ID: abc123... # Size: 1.2 MB # System project: _npm # System package: lodash # System tag: 4.17.21 ``` - [ ] Options: | Option | Description | |--------|-------------| | `--type, -t TYPE` | Source type: npm, pypi, maven, docker, helm, generic (auto-detected from URL if not provided) | | `--package, -p NAME` | Package name in system project (auto-derived from URL if not provided) | | `--tag TAG` | Tag name in system project (auto-derived from URL if not provided) | | `--project PROJECT` | Also create tag in this user project | | `--user-package PKG` | Package name in user project (required if --project specified) | | `--user-tag TAG` | Tag name in user project (default: same as system tag) | | `--expected-hash HASH` | Verify downloaded content matches this SHA256 | | `--add` | Add to orchard.ensure after caching | | `--add-path PATH` | Extraction path for --add (default: `/`) | | `--file, -f FILE` | Path to orchard.ensure file | | `--verbose, -v` | Show detailed output | - [ ] URL type auto-detection: - `registry.npmjs.org` → npm - `pypi.org` or `files.pythonhosted.org` → pypi - `repo1.maven.org` or contains `/maven2/` → maven - `registry-1.docker.io` or `docker.io` → docker - Otherwise → generic - [ ] Package/version extraction from URL patterns: - npm: `/{package}/-/{package}-{version}.tgz` - pypi: `/packages/.../requests-{version}.tar.gz` - maven: `/{group}/{artifact}/{version}/{artifact}-{version}.jar` - [ ] Add `cache_artifact()` function to `orchard/api.py` - [ ] Integration with `--add` flag: - Parse existing orchard.ensure - Add new dependency entry pointing to cached artifact - Use artifact_id (SHA256) for hermetic pinning - [ ] Batch mode: `orchard cache --file urls.txt` - One URL per line - Lines starting with `#` are comments - Report success/failure for each - [ ] Exit codes: - 0: Success (or already cached) - 1: Fetch failed - 2: Hash mismatch - 3: Air-gap mode blocked request - [ ] Error handling consistent with existing CLI patterns - [ ] Unit tests in `test/test_cache.py` - [ ] Update README.md with cache command documentation ## Technical Notes - Follow existing Click patterns from other commands - Use `get_auth_headers()` from `orchard/auth.py` - URL parsing can use `urllib.parse` - Consider adding URL pattern registry for extensibility - The `--add` flag should integrate with existing ensure file parsing in `orchard/ensure.py` ## Example Workflows ```bash # Simple: cache a single URL orchard cache https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz # Cache and add to orchard.ensure for current project orchard cache https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz \ --add --add-path libs/lodash/ # Cache with explicit metadata orchard cache https://internal.corp/files/custom-lib.tar.gz \ --type generic \ --package custom-lib \ --tag v1.0.0 # Cache and cross-reference to user project orchard cache https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz \ --project my-app \ --user-package npm-deps \ --user-tag lodash-4.17.21 # Batch cache from file orchard cache --file deps-urls.txt # Verify hash while caching orchard cache https://example.com/file.tar.gz \ --expected-hash sha256:abc123... ``` --- ## Out of Scope (Future Enhancements) - Automatic transitive dependency resolution (client's responsibility) - Lockfile parsing (`package-lock.json`, `requirements.txt`) - stretch goal for CLI - Cache eviction policies (we cache forever by design) - Mirroring/sync between Orchard instances - Format-specific metadata extraction (npm package.json parsing, etc.) ## Success Criteria - [ ] Can cache any URL and retrieve by SHA256 hash - [ ] Cached artifacts persist indefinitely - [ ] Air-gap mode blocks all public internet access - [ ] Multiple upstream sources with different auth - [ ] System projects organize cached packages by format - [ ] CLI can cache URLs and update orchard.ensure - [ ] Admin UI for upstream source management