From 9cadfa3b1bb21085545fd5fed414ff340f5d213d Mon Sep 17 00:00:00 2001 From: Mondo Diaz Date: Wed, 4 Feb 2026 08:56:40 -0600 Subject: [PATCH] Add PyPI proxy performance & multi-protocol architecture design Comprehensive design for: - HTTP connection pooling with lifecycle management - Redis caching layer (TTL for discovery, permanent for immutable) - Abstract PackageProxyBase for multi-protocol support (npm, Maven) - Database query optimization with batch operations - Dependency resolution caching for ensure files - Observability via health endpoints Maintains hermetic build guarantees: artifact content and extracted metadata are immutable, only discovery data uses TTL-based caching. --- ...026-02-04-pypi-proxy-performance-design.md | 228 ++++++++++++++++++ 1 file changed, 228 insertions(+) create mode 100644 docs/plans/2026-02-04-pypi-proxy-performance-design.md diff --git a/docs/plans/2026-02-04-pypi-proxy-performance-design.md b/docs/plans/2026-02-04-pypi-proxy-performance-design.md new file mode 100644 index 0000000..da1b226 --- /dev/null +++ b/docs/plans/2026-02-04-pypi-proxy-performance-design.md @@ -0,0 +1,228 @@ +# PyPI Proxy Performance & Multi-Protocol Architecture Design + +**Date:** 2026-02-04 +**Status:** Approved +**Branch:** fix/pypi-proxy-timeout + +## Overview + +Comprehensive infrastructure overhaul to address latency, throughput, and resource consumption issues in the PyPI proxy, while establishing a foundation for npm, Maven, and other package protocols. + +## Goals + +1. **Reduce latency** - Eliminate per-request connection overhead, cache aggressively +2. **Increase throughput** - Handle hundreds of concurrent requests without degradation +3. **Lower resource usage** - Connection pooling, efficient DB queries, proper async I/O +4. **Enable multi-protocol** - Abstract base class ready for npm/Maven/etc. +5. **Maintain hermetic builds** - Immutable artifact content and metadata, mutable discovery data + +## Architecture + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ FastAPI Application │ +│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ +│ │ PyPI Proxy │ │ npm Proxy │ │ Maven Proxy │ │ (future) │ │ +│ │ Router │ │ Router │ │ Router │ │ │ │ +│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └─────────────┘ │ +│ │ │ │ │ +│ └────────────────┼────────────────┘ │ +│ ▼ │ +│ ┌───────────────────────┐ │ +│ │ PackageProxyBase │ ← Abstract base class │ +│ │ - check_cache() │ │ +│ │ - fetch_upstream() │ │ +│ │ - store_artifact() │ │ +│ │ - serve_artifact() │ │ +│ └───────────┬───────────┘ │ +│ │ │ +│ ┌────────────────┼────────────────┐ │ +│ ▼ ▼ ▼ │ +│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ +│ │ HttpClient │ │ CacheService│ │ ThreadPool │ │ +│ │ Manager │ │ (Redis) │ │ Executor │ │ +│ └─────────────┘ └─────────────┘ └─────────────┘ │ +│ │ │ │ │ +└─────────┼────────────────┼────────────────┼──────────────────────────┘ + ▼ ▼ ▼ + ┌──────────┐ ┌──────────┐ ┌──────────────┐ + │ Upstream │ │ Redis │ │ S3/MinIO │ + │ Sources │ │ │ │ │ + └──────────┘ └──────────┘ └──────────────┘ +``` + +## Components + +### 1. HttpClientManager + +Manages httpx.AsyncClient pools with FastAPI lifespan integration. + +**Features:** +- Default pool for general requests +- Per-upstream pools for sources needing specific config/auth +- Graceful shutdown drains in-flight requests +- Dedicated thread pool for blocking operations + +**Configuration:** +```bash +ORCHARD_HTTP_MAX_CONNECTIONS=100 # Default pool size +ORCHARD_HTTP_KEEPALIVE_CONNECTIONS=20 # Keep-alive connections +ORCHARD_HTTP_CONNECT_TIMEOUT=30 # Connection timeout (seconds) +ORCHARD_HTTP_READ_TIMEOUT=60 # Read timeout (seconds) +ORCHARD_HTTP_WORKER_THREADS=32 # Thread pool size +``` + +**File:** `backend/app/http_client.py` + +### 2. CacheService (Redis Layer) + +Redis-backed caching with category-aware TTL and invalidation. + +**Cache Categories:** + +| Category | TTL | Invalidation | Purpose | +|----------|-----|--------------|---------| +| ARTIFACT_METADATA | Forever | Never (immutable) | Artifact info by SHA256 | +| ARTIFACT_DEPENDENCIES | Forever | Never (immutable) | Extracted deps by SHA256 | +| DEPENDENCY_RESOLUTION | Forever | Manual/refresh param | Resolution results | +| UPSTREAM_SOURCES | 1 hour | On DB change | Upstream config | +| PACKAGE_INDEX | 5 min | TTL only | PyPI/npm index pages | +| PACKAGE_VERSIONS | 5 min | TTL only | Version listings | + +**Key format:** `orchard:{category}:{protocol}:{identifier}` + +**Configuration:** +```bash +ORCHARD_REDIS_HOST=redis +ORCHARD_REDIS_PORT=6379 +ORCHARD_REDIS_DB=0 +ORCHARD_CACHE_TTL_INDEX=300 # Package index: 5 minutes +ORCHARD_CACHE_TTL_VERSIONS=300 # Version listings: 5 minutes +ORCHARD_CACHE_TTL_UPSTREAM=3600 # Upstream config: 1 hour +``` + +**File:** `backend/app/cache_service.py` + +### 3. PackageProxyBase + +Abstract base class defining the cache→fetch→store→serve flow. + +**Abstract methods (protocol-specific):** +- `get_protocol_name()` - Return 'pypi', 'npm', 'maven' +- `get_system_project_name()` - Return '_pypi', '_npm' +- `rewrite_index_html()` - Rewrite upstream index to Orchard URLs +- `extract_metadata()` - Extract deps from package file +- `parse_package_url()` - Parse URL into package/version/filename + +**Concrete methods (shared):** +- `serve_index()` - Serve package index with caching +- `serve_artifact()` - Full cache→fetch→store→serve flow + +**File:** `backend/app/proxy_base.py` + +### 4. ArtifactRepository (DB Optimization) + +Optimized database operations eliminating N+1 queries. + +**Key methods:** +- `get_or_create_artifact()` - Atomic upsert via ON CONFLICT +- `batch_upsert_dependencies()` - Single INSERT for all deps +- `get_cached_url_with_artifact()` - Joined query for cache lookup + +**Query reduction:** + +| Operation | Before | After | +|-----------|--------|-------| +| Cache hit check | 2 queries | 1 query (joined) | +| Store artifact | 3-4 queries | 1 query (upsert) | +| Store 50 deps | 50+ queries | 1 query (batch) | + +**Configuration:** +```bash +ORCHARD_DATABASE_POOL_SIZE=20 # Base connections (up from 5) +ORCHARD_DATABASE_MAX_OVERFLOW=30 # Burst capacity (up from 10) +ORCHARD_DATABASE_POOL_TIMEOUT=30 # Wait timeout +ORCHARD_DATABASE_POOL_PRE_PING=false # Disable in prod for performance +``` + +**File:** `backend/app/db_utils.py` + +### 5. Dependency Resolution Caching + +Cache resolution results for ensure files and API queries. + +**Cache key:** Hash of (artifact_id, max_depth, include_optional) + +**Invalidation:** Manual only (immutable artifact deps mean cached resolutions stay valid) + +**Refresh:** `?refresh=true` parameter forces fresh resolution + +**File:** Updates to `backend/app/dependencies.py` + +### 6. FastAPI Integration + +Lifespan-managed infrastructure with dependency injection. + +**Startup:** +1. Initialize HttpClientManager (connection pools) +2. Initialize CacheService (Redis connection) +3. Load upstream source configs + +**Shutdown:** +1. Drain in-flight HTTP requests +2. Close Redis connections +3. Shutdown thread pool + +**Health endpoint additions:** +- Database connection status +- Redis ping +- HTTP pool active/max connections +- Thread pool active/max workers + +**File:** Updates to `backend/app/main.py` + +## Files Summary + +**New files:** +- `backend/app/http_client.py` - HttpClientManager +- `backend/app/cache_service.py` - CacheService +- `backend/app/proxy_base.py` - PackageProxyBase +- `backend/app/db_utils.py` - ArtifactRepository + +**Modified files:** +- `backend/app/config.py` - New settings +- `backend/app/main.py` - Lifespan integration +- `backend/app/pypi_proxy.py` - Refactor to use base class +- `backend/app/dependencies.py` - Resolution caching +- `backend/app/routes.py` - Health endpoint, DI + +## Hermetic Build Guarantees + +**Immutable (cached forever):** +- Artifact content (by SHA256) +- Extracted dependencies for a specific artifact +- Dependency resolution results + +**Mutable (TTL + event invalidation):** +- Package index listings +- Version discovery +- Upstream source configuration + +Once an artifact is cached with SHA256 `abc123` and dependencies extracted, that data never changes. + +## Performance Expectations + +| Metric | Before | After | +|--------|--------|-------| +| HTTP connection setup | Per request (~100-500ms) | Pooled (~5ms) | +| Cache hit (index page) | N/A | ~5ms (Redis) | +| Store 50 dependencies | ~500ms (50 queries) | ~10ms (1 query) | +| Dependency resolution (cached) | N/A | ~5ms | +| Concurrent request capacity | ~15 (DB pool) | ~50 (configurable) | + +## Testing Requirements + +- Unit tests for each new component +- Integration tests for full proxy flow +- Load tests to verify pool sizing +- Cache hit/miss verification tests