26 KiB
Epic: Upstream Artifact Caching for Hermetic Builds
Overview
Orchard will act as a permanent, content-addressable cache for upstream artifacts (npm, PyPI, Maven, Docker, etc.). Once an artifact is cached, it is stored forever by SHA256 hash - enabling reproducible builds years later regardless of whether the upstream source still exists.
Problem Statement
Build reproducibility is critical for enterprise environments:
- Packages get deleted, yanked, or modified upstream
- Registries go down or change URLs
- Version constraints resolve differently over time
- Air-gapped environments cannot access public internet
Teams need to guarantee that a build from 5 years ago produces the exact same output today.
Solution
Orchard becomes "the cache that never forgets":
- Fetch once, store forever - When a build needs
lodash@4.17.21, Orchard fetches it from npm, stores it by SHA256 hash, and never deletes it - Content-addressable - Same hash = same bytes, guaranteed
- Format-agnostic - Orchard doesn't need to understand npm/PyPI/Maven protocols; the client provides the URL, Orchard fetches and stores
- Air-gap support - Disable public internet entirely, only allow configured private upstreams
User Workflow
1. Build tool resolves dependencies npm install / pip install / mvn resolve
↓
2. Generate lockfile with URLs package-lock.json / requirements.txt
↓
3. Cache all URLs in Orchard orchard cache --file urls.txt
↓
4. Pin by SHA256 hash lodash = "sha256:abc123..."
↓
5. Future builds fetch by hash Always get exact same bytes
Key Features
- Multiple upstream sources - Configure npm, PyPI, Maven Central, private Artifactory, etc.
- Per-source authentication - Basic auth, bearer tokens, API keys
- System cache projects -
_npm,_pypi,_mavenorganize cached packages by format - Cross-referencing - Link cached artifacts to user projects for visibility
- URL tracking - Know which URLs map to which hashes, audit provenance
- Air-gap mode - Global kill switch for all public internet access
- Environment variable config - 12-factor friendly for containerized deployments
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Orchard Server │
├─────────────────────────────────────────────────────────────────┤
│ POST /api/v1/cache │
│ ├── Check if URL already cached (url_hash lookup) │
│ ├── Match URL to upstream source (get auth) │
│ ├── Fetch via UpstreamClient (stream + compute SHA256) │
│ ├── Store artifact in S3 (content-addressable) │
│ ├── Create tag in system project (_npm/lodash:4.17.21) │
│ ├── Optionally create tag in user project │
│ └── Record in cached_urls table (provenance) │
├─────────────────────────────────────────────────────────────────┤
│ Tables │
│ ├── upstream_sources (npm-public, pypi-public, artifactory) │
│ ├── cache_settings (allow_public_internet, etc.) │
│ ├── cached_urls (url → artifact_id mapping) │
│ └── projects.is_system (for _npm, _pypi, etc.) │
└─────────────────────────────────────────────────────────────────┘
Issues Summary
| Issue | Title | Status | Dependencies |
|---|---|---|---|
| #68 | Schema: Upstream Sources & Cache Tracking | ✅ Complete | None |
| #69 | HTTP Client: Generic URL Fetcher | Pending | None |
| #70 | Cache API Endpoint | Pending | #68, #69 |
| #71 | System Projects (Cache Namespaces) | Pending | #68, #70 |
| #72 | Upstream Sources Admin API | Pending | #68 |
| #73 | Global Cache Settings API | Pending | #68 |
| #74 | Environment Variable Overrides | Pending | #68, #72, #73 |
| #75 | Frontend: Upstream Sources Management | Pending | #72, #73 |
| #105 | Frontend: System Projects Integration | Pending | #71 |
| #77 | CLI: Cache Command | Pending | #70 |
Implementation Phases
Phase 1 - Core (MVP):
- #68 Schema ✅
- #69 HTTP Client
- #70 Cache API
- #71 System Projects
Phase 2 - Admin:
- #72 Upstream Sources API
- #73 Cache Settings API
- #74 Environment Variables
Phase 3 - Frontend:
- #75 Upstream Sources UI
- #105 System Projects UI
Phase 4 - CLI:
- #77 Cache Command
Issue #68: Schema - Upstream Sources & Cache Tracking
Status: ✅ Complete
Description
Create database schema for flexible multi-source upstream configuration and URL-to-artifact tracking. This replaces the previous singleton proxy_config design with a more flexible model supporting multiple upstream sources, air-gap mode, and provenance tracking.
Acceptance Criteria
upstream_sourcestable:- id (UUID, primary key)
- name (VARCHAR(255), unique, e.g., "npm-public", "artifactory-private")
- source_type (VARCHAR(50), enum: npm, pypi, maven, docker, helm, nuget, deb, rpm, generic)
- url (VARCHAR(2048), base URL of upstream)
- enabled (BOOLEAN, default false)
- is_public (BOOLEAN, true if this is a public internet source)
- auth_type (VARCHAR(20), enum: none, basic, bearer, api_key)
- username (VARCHAR(255), nullable)
- password_encrypted (BYTEA, nullable, Fernet encrypted)
- headers_encrypted (BYTEA, nullable, for custom headers like API keys)
- priority (INTEGER, default 100, lower = checked first)
- created_at, updated_at timestamps
cache_settingstable (singleton, id always 1):- id (INTEGER, primary key, check id = 1)
- allow_public_internet (BOOLEAN, default true, air-gap kill switch)
- auto_create_system_projects (BOOLEAN, default true)
- created_at, updated_at timestamps
cached_urlstable:- id (UUID, primary key)
- url (VARCHAR(4096), original URL fetched)
- url_hash (VARCHAR(64), SHA256 of URL for fast lookup, indexed)
- artifact_id (VARCHAR(64), FK to artifacts)
- source_id (UUID, FK to upstream_sources, nullable for manual imports)
- fetched_at (TIMESTAMP WITH TIME ZONE)
- response_headers (JSONB, original upstream headers for provenance)
- created_at timestamp
- Add
is_systemBOOLEAN column to projects table (default false) - Migration SQL file in migrations/
- Runtime migration in database.py
- SQLAlchemy models for all new tables
- Pydantic schemas for API input/output (passwords write-only)
- Encryption helpers for password/headers fields
- Seed default upstream sources (disabled by default):
- npm-public: https://registry.npmjs.org
- pypi-public: https://pypi.org/simple
- maven-central: https://repo1.maven.org/maven2
- docker-hub: https://registry-1.docker.io
- Unit tests for models and schemas
Files Modified
migrations/010_upstream_caching.sqlbackend/app/database.py(migrations 016-020)backend/app/models.py(UpstreamSource, CacheSettings, CachedUrl, Project.is_system)backend/app/schemas.py(all caching schemas)backend/app/encryption.py(renamed env var)backend/app/config.py(renamed setting)backend/tests/test_upstream_caching.py(37 tests)frontend/src/components/Layout.tsx(footer tagline)CHANGELOG.md
Issue #69: HTTP Client - Generic URL Fetcher
Status: Pending
Description
Create a reusable HTTP client for fetching artifacts from upstream sources. Supports multiple auth methods, streaming for large files, and computes SHA256 while downloading.
Acceptance Criteria
UpstreamClientclass inbackend/app/upstream.pyfetch(url)method that:- Streams response body (doesn't load large files into memory)
- Computes SHA256 hash while streaming
- Returns file content, hash, size, and response headers
- Auth support based on upstream source configuration:
- None (anonymous)
- Basic auth (username/password)
- Bearer token (Authorization: Bearer {token})
- API key (custom header name/value)
- URL-to-source matching:
- Match URL to configured upstream source by URL prefix
- Apply auth from matched source
- Respect source priority for multiple matches
- Configuration options:
- Timeout (connect and read, default 30s/300s)
- Max retries (default 3)
- Follow redirects (default true, max 5)
- Max file size (reject if Content-Length exceeds limit)
- Respect
allow_public_internetsetting:- If false, reject URLs matching
is_public=truesources - If false, reject URLs not matching any configured source
- If false, reject URLs matching
- Capture response headers for provenance tracking
- Proper error handling:
- Connection errors (retry with backoff)
- HTTP errors (4xx, 5xx)
- Timeout errors
- SSL/TLS errors
- Logging for debugging (URL, source matched, status, timing)
- Unit tests with mocked HTTP responses
- Integration tests against httpbin.org or similar (optional, marked)
Technical Notes
- Use
httpxfor async HTTP support (already in requirements) - Stream to temp file to avoid memory issues with large artifacts
- Consider checksum verification if upstream provides it (e.g., npm provides shasum)
Issue #70: Cache API Endpoint
Status: Pending
Description
API endpoint to cache an artifact from an upstream URL. This is the core endpoint that fetches from upstream, stores in Orchard, and creates appropriate tags.
Acceptance Criteria
POST /api/v1/cacheendpoint- Request body:
{ "url": "https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz", "source_type": "npm", "package_name": "lodash", "tag": "4.17.21", "user_project": "my-app", "user_package": "npm-deps", "user_tag": "lodash-4.17.21", "expected_hash": "sha256:abc123..." }url(required): URL to fetchsource_type(required): Determines system project (_npm, _pypi, etc.)package_name(optional): Package name in system project, derived from URL if not providedtag(optional): Tag name in system project, derived from URL if not provideduser_project,user_package,user_tag(optional): Cross-reference in user's projectexpected_hash(optional): Verify downloaded content matches
- Response:
{ "artifact_id": "abc123...", "sha256": "abc123...", "size": 12345, "content_type": "application/gzip", "already_cached": false, "source_url": "https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz", "source_name": "npm-public", "system_project": "_npm", "system_package": "lodash", "system_tag": "4.17.21", "user_reference": "my-app/npm-deps:lodash-4.17.21" } - Behavior:
- Check if URL already cached (by url_hash in cached_urls)
- If cached: return existing artifact, optionally create user tag
- If not cached: fetch via UpstreamClient, store artifact, create tags
- Create/get system project if needed (e.g.,
_npm) - Create package in system project (e.g.,
_npm/lodash) - Create tag in system project (e.g.,
_npm/lodash:4.17.21) - If user reference provided, create tag in user's project
- Record in cached_urls table with provenance
- Error handling:
- 400: Invalid request (bad URL format, missing required fields)
- 403: Air-gap mode enabled and URL is from public source
- 404: Upstream returned 404
- 409: Hash mismatch (if expected_hash provided)
- 502: Upstream fetch failed (connection error, timeout)
- 503: Upstream source disabled
- Authentication required (any authenticated user can cache)
- Audit logging for cache operations
- Integration tests covering success and error cases
Technical Notes
- URL parsing for package_name/tag derivation is format-specific:
- npm:
/{package}/-/{package}-{version}.tgz→ package=lodash, tag=4.17.21 - pypi:
/packages/.../requests-2.28.0.tar.gz→ package=requests, tag=2.28.0 - maven:
/{group}/{artifact}/{version}/{artifact}-{version}.jar
- npm:
- Deduplication: if same SHA256 already exists, just create new tag pointing to it
Issue #71: System Projects (Cache Namespaces)
Status: Pending
Description
Implement auto-created system projects for organizing cached artifacts by format type. These are special projects that provide a browsable namespace for all cached upstream packages.
Acceptance Criteria
- System project names:
_npm,_pypi,_maven,_docker,_helm,_nuget,_deb,_rpm,_generic - Auto-creation:
- Created automatically on first cache request for that format
- Created by cache endpoint, not at startup
- Uses system user as creator (
created_by = "system")
- System project properties:
is_system = trueis_public = true(readable by all authenticated users)description= "System cache for {format} packages"
- Restrictions:
- Cannot be deleted (return 403 with message)
- Cannot be renamed
- Cannot change
is_publicto false - Only admins can modify description
- Helper function:
get_or_create_system_project(source_type)in routes.py or new cache.py module - Update project deletion endpoint to check
is_systemflag - Update project update endpoint to enforce restrictions
- Query helper: list all system projects for UI dropdown
- Unit tests for restrictions
- Integration tests for auto-creation and restrictions
Technical Notes
- System projects are identified by
is_system=true, not just naming convention - The
_prefix is a convention for display purposes - Packages within system projects follow upstream naming (e.g.,
_npm/lodash,_npm/@types/node)
Issue #72: Upstream Sources Admin API
Status: Pending
Description
CRUD API endpoints for managing upstream sources configuration. Admin-only access.
Acceptance Criteria
GET /api/v1/admin/upstream-sources- List all upstream sources- Returns array of sources with id, name, source_type, url, enabled, is_public, auth_type, priority, has_credentials, created_at, updated_at
- Supports
?enabled=true/falsefilter - Supports
?source_type=npm,pypifilter - Passwords/tokens never returned
POST /api/v1/admin/upstream-sources- Create upstream source- Request: name, source_type, url, enabled, is_public, auth_type, username, password, headers, priority
- Validates unique name
- Validates URL format
- Encrypts password/headers before storage
- Returns created source (without secrets)
GET /api/v1/admin/upstream-sources/{id}- Get source details- Returns source with
has_credentialsboolean, not actual credentials
- Returns source with
PUT /api/v1/admin/upstream-sources/{id}- Update source- Partial update supported
- If password provided, re-encrypt; if omitted, keep existing
- Special value
password: nullclears credentials
DELETE /api/v1/admin/upstream-sources/{id}- Delete source- Returns 400 if source has cached_urls referencing it (optional: cascade or reassign)
POST /api/v1/admin/upstream-sources/{id}/test- Test connectivity- Attempts HEAD request to source URL
- Returns success/failure with status code and timing
- Does not cache anything
- All endpoints require admin role
- Audit logging for all mutations
- Pydantic schemas: UpstreamSourceCreate, UpstreamSourceUpdate, UpstreamSourceResponse
- Integration tests for all endpoints
Technical Notes
- Test endpoint should respect auth configuration to verify credentials work
- Consider adding
last_used_atandlast_errorfields for observability (future enhancement)
Issue #73: Global Cache Settings API
Status: Pending
Description
API endpoints for managing global cache settings including air-gap mode.
Acceptance Criteria
GET /api/v1/admin/cache-settings- Get current settings- Returns: allow_public_internet, auto_create_system_projects, created_at, updated_at
PUT /api/v1/admin/cache-settings- Update settings- Partial update supported
- Returns updated settings
- Settings fields:
allow_public_internet(boolean): When false, blocks all requests to sources markedis_public=trueauto_create_system_projects(boolean): When false, system projects must be created manually
- Admin-only access
- Audit logging for changes (especially air-gap mode changes)
- Pydantic schemas: CacheSettingsResponse, CacheSettingsUpdate
- Initialize singleton row on first access if not exists
- Integration tests
Technical Notes
- Air-gap mode change should be logged prominently (security-relevant)
- Consider requiring confirmation header for disabling air-gap mode (similar to factory reset)
Issue #74: Environment Variable Overrides
Status: Pending
Description
Allow cache and upstream configuration via environment variables for containerized deployments. Environment variables override database settings following 12-factor app principles.
Acceptance Criteria
- Global settings overrides:
ORCHARD_CACHE_ALLOW_PUBLIC_INTERNET=true/falseORCHARD_CACHE_AUTO_CREATE_SYSTEM_PROJECTS=true/falseORCHARD_CACHE_ENCRYPTION_KEY(Fernet key for credential encryption)
- Upstream source definition via env vars:
ORCHARD_UPSTREAM__{NAME}__URL(double underscore as separator)ORCHARD_UPSTREAM__{NAME}__TYPE(npm, pypi, maven, etc.)ORCHARD_UPSTREAM__{NAME}__ENABLED(true/false)ORCHARD_UPSTREAM__{NAME}__IS_PUBLIC(true/false)ORCHARD_UPSTREAM__{NAME}__AUTH_TYPE(none, basic, bearer, api_key)ORCHARD_UPSTREAM__{NAME}__USERNAMEORCHARD_UPSTREAM__{NAME}__PASSWORDORCHARD_UPSTREAM__{NAME}__PRIORITY- Example:
ORCHARD_UPSTREAM__NPM_PRIVATE__URL=https://npm.corp.com
- Env var sources:
- Loaded at startup
- Merged with database sources
- Env var sources have
source = "env"marker - Cannot be modified via API (return 400)
- Cannot be deleted via API (return 400)
- Update Settings class in config.py
- Update get/list endpoints to include env-defined sources
- Document all env vars in CLAUDE.md
- Unit tests for env var parsing
- Integration tests with env vars set
Technical Notes
- Double underscore (
__) separator allows source names with single underscores - Env-defined sources should appear in API responses but marked as read-only
- Consider startup validation that warns about invalid env var combinations
Issue #75: Frontend - Upstream Sources Management
Status: Pending
Description
Admin UI for managing upstream sources and cache settings.
Acceptance Criteria
- New admin page:
/admin/cacheor/admin/upstream-sources - Upstream sources section:
- Table listing all sources with: name, type, URL, enabled toggle, public badge, priority, actions
- Visual distinction for env-defined sources (locked icon, no edit/delete)
- Create button opens modal/form
- Edit button for DB-defined sources
- Delete with confirmation modal
- Test connection button with status indicator
- Create/edit form fields:
- Name (text, required)
- Source type (dropdown)
- URL (text, required)
- Priority (number)
- Is public (checkbox)
- Enabled (checkbox)
- Auth type (dropdown: none, basic, bearer, api_key)
- Conditional auth fields based on type:
- Basic: username, password
- Bearer: token
- API key: header name, header value
- Password fields masked, "unchanged" placeholder on edit
- Cache settings section:
- Air-gap mode toggle with warning
- Auto-create system projects toggle
- "Air-gap mode" shows prominent warning banner when enabled
- Link from main admin navigation
- Loading and error states
- Success/error toast notifications
Technical Notes
- Use existing admin page patterns from user management
- Air-gap toggle should require confirmation (modal with warning text)
Issue #105: Frontend - System Projects Integration
Status: Pending
Description
Integrate system projects into the frontend UI with appropriate visual treatment and navigation.
Acceptance Criteria
- Home page project dropdown:
- System projects shown in separate "Cached Packages" section
- Visual distinction (icon, different background, or badge)
- Format icon for each type (npm, pypi, maven, etc.)
- Project list/grid:
- System projects can be filtered: "Show system projects" toggle
- Or separate tab: "Projects" | "Package Cache"
- System project page:
- "System Cache" badge in header
- Description explains this is auto-managed cache
- Settings/delete buttons hidden or disabled
- Shows format type prominently
- Package page within system project:
- Shows "Cached from" with source URL (linked)
- Shows "First cached" timestamp
- Shows which upstream source provided it
- Artifact page:
- If artifact came from cache, show provenance:
- Original URL
- Upstream source name
- Fetch timestamp
- If artifact came from cache, show provenance:
- Search includes system projects (with filter option)
Technical Notes
- Use React context or query params for system project filtering
- Consider dedicated route:
/cache/npm/lodashas alias for/_npm/lodash
Issue #77: CLI - Cache Command
Status: Pending
Description
Add a new orchard cache command to the existing CLI for caching artifacts from upstream URLs. This integrates with the new cache API endpoint and can optionally update orchard.ensure with cached artifacts.
Acceptance Criteria
- New command:
orchard cache <url>inorchard/commands/cache.py - Basic usage:
# Cache a URL, print artifact info orchard cache https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz # Output: # Caching https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz... # Source type: npm # Package: lodash # Version: 4.17.21 # # Successfully cached artifact # Artifact ID: abc123... # Size: 1.2 MB # System project: _npm # System package: lodash # System tag: 4.17.21 - Options:
Option Description --type, -t TYPESource type: npm, pypi, maven, docker, helm, generic (auto-detected from URL if not provided) --package, -p NAMEPackage name in system project (auto-derived from URL if not provided) --tag TAGTag name in system project (auto-derived from URL if not provided) --project PROJECTAlso create tag in this user project --user-package PKGPackage name in user project (required if --project specified) --user-tag TAGTag name in user project (default: same as system tag) --expected-hash HASHVerify downloaded content matches this SHA256 --addAdd to orchard.ensure after caching --add-path PATHExtraction path for --add (default: <package>/)--file, -f FILEPath to orchard.ensure file --verbose, -vShow detailed output - URL type auto-detection:
registry.npmjs.org→ npmpypi.orgorfiles.pythonhosted.org→ pypirepo1.maven.orgor contains/maven2/→ mavenregistry-1.docker.ioordocker.io→ docker- Otherwise → generic
- Package/version extraction from URL patterns:
- npm:
/{package}/-/{package}-{version}.tgz - pypi:
/packages/.../requests-{version}.tar.gz - maven:
/{group}/{artifact}/{version}/{artifact}-{version}.jar
- npm:
- Add
cache_artifact()function toorchard/api.py - Integration with
--addflag:- Parse existing orchard.ensure
- Add new dependency entry pointing to cached artifact
- Use artifact_id (SHA256) for hermetic pinning
- Batch mode:
orchard cache --file urls.txt- One URL per line
- Lines starting with
#are comments - Report success/failure for each
- Exit codes:
- 0: Success (or already cached)
- 1: Fetch failed
- 2: Hash mismatch
- 3: Air-gap mode blocked request
- Error handling consistent with existing CLI patterns
- Unit tests in
test/test_cache.py - Update README.md with cache command documentation
Technical Notes
- Follow existing Click patterns from other commands
- Use
get_auth_headers()fromorchard/auth.py - URL parsing can use
urllib.parse - Consider adding URL pattern registry for extensibility
- The
--addflag should integrate with existing ensure file parsing inorchard/ensure.py
Example Workflows
# Simple: cache a single URL
orchard cache https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz
# Cache and add to orchard.ensure for current project
orchard cache https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz \
--add --add-path libs/lodash/
# Cache with explicit metadata
orchard cache https://internal.corp/files/custom-lib.tar.gz \
--type generic \
--package custom-lib \
--tag v1.0.0
# Cache and cross-reference to user project
orchard cache https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz \
--project my-app \
--user-package npm-deps \
--user-tag lodash-4.17.21
# Batch cache from file
orchard cache --file deps-urls.txt
# Verify hash while caching
orchard cache https://example.com/file.tar.gz \
--expected-hash sha256:abc123...
Out of Scope (Future Enhancements)
- Automatic transitive dependency resolution (client's responsibility)
- Lockfile parsing (
package-lock.json,requirements.txt) - stretch goal for CLI - Cache eviction policies (we cache forever by design)
- Mirroring/sync between Orchard instances
- Format-specific metadata extraction (npm package.json parsing, etc.)
Success Criteria
- Can cache any URL and retrieve by SHA256 hash
- Cached artifacts persist indefinitely
- Air-gap mode blocks all public internet access
- Multiple upstream sources with different auth
- System projects organize cached packages by format
- CLI can cache URLs and update orchard.ensure
- Admin UI for upstream source management