Technical Documentation

Return to Home

System Architecture & Engineering Decisions

A fully automated, cloud-native ETL pipeline and web application built to securely ingest, process, and serve historical sports data. This document outlines the architectural choices, tradeoffs, and DevOps practices used to build a resilient, production-grade system on Google Cloud Platform (GCP).

1. System Architecture

The platform is designed around a decoupled, pull-based architecture separating heavy data ingestion from the lightweight API serving layer.

Data Ingestion: Automated Python scraping scripts utilizing TLS fingerprinting bypass bot-detection, extracting raw JSON payloads.
Data Lake (GCS): Raw artifacts are cached in Google Cloud Storage utilizing a deterministic, chronological path structure to ensure replayability.
Relational Store (Cloud SQL): Data is normalized and upserted into PostgreSQL, heavily utilizing index-backed ON CONFLICT constraints to guarantee idempotency.
Serving Layer (Cloud Run): A containerized FastAPI microservice queries the database and dynamically renders the UI using Jinja2 templates and Bootstrap 5.

2. Core Engineering Decisions & Tradeoffs

Engineering is about choosing the right compromises. Below are the key decisions made during the design of this system:

Compute: Serverless vs. Orchestration

The Decision: Chosen fully managed Cloud Run over Kubernetes (GKE).
The Tradeoff: Traded the granular cluster control and advanced networking meshes of Kubernetes for zero-maintenance auto-scaling and scale-to-zero cost efficiency. For a stateless API, Cloud Run drastically reduces operational overhead while maintaining container portability.

Security: Keyless CI/CD vs. Service Account Keys

The Decision: Implemented Workload Identity Federation (WIF) for GitHub Actions deployments.
The Tradeoff: Traded the initial setup complexity of OIDC trust pools for a strict Zero-Trust architecture. This completely eliminates the severe security risks associated with downloading, storing, and rotating long-lived JSON service account keys.

Release Engineering: Blue/Green Deployments vs. In-Place Updates

The Decision: Implemented Blue/Green traffic splitting in Cloud Run via GitHub Actions.
The Tradeoff: Traded the simplicity of a basic "push-to-latest" deployment for a slightly more complex CI/CD pipeline. By deploying new revisions with 0% public traffic and a dev tag, new code can be validated in production on a private URL. Once verified, traffic is seamlessly shifted to the new revision, guaranteeing zero downtime and protecting the end-user experience.

Data Integrity: Write-Time Complexity vs. Read-Time Latency

The Decision: Enforced idempotent data pipelines via dynamic GCS pathing and PostgreSQL ON CONFLICT upserts.
The Tradeoff: Shifted heavy processing and complexity to the write-path (the ETL pipeline). While ingestion takes longer, it guarantees that overlapping cron jobs or manual retries cannot corrupt the state or duplicate data, while keeping API read operations lightning fast.

3. Infrastructure as Code (IaC)

The foundation of this project is completely codified using Terraform. By strictly managing state and resources via code, the project achieves:

Disaster Recovery: The ability to completely destroy and confidently recreate the entire GCP architecture in minutes.
Cost Management: Facilitates rapid tear-downs of expensive resources during development lulls without losing configuration integrity.
Lifecycle Management: Terraform provisions the baseline infrastructure but is explicitly configured to ignore Cloud Run traffic changes, delegating release management entirely to the GitHub Actions CI/CD pipeline.

4. Observability & Reliability

Modern infrastructure requires deep visibility. To move beyond standard monitoring, this application is heavily instrumented to provide actionable insights during traffic spikes.

The LGTM Stack: The Python application utilizes OpenTelemetry to generate distributed traces, custom metrics, and application logs. These are routed to a local, industry-standard LGTM stack (Loki, Grafana, Tempo, Prometheus).
Chaos & Load Testing: The API's resilience was validated using k6 to simulate concurrent user spikes. Distributed tracing (Tempo waterfalls) was leveraged to identify and isolate PostgreSQL query latency under heavy load, ensuring the database connection pool remained stable.

5. Future Roadmap

Cloud Telemetry Leap: Migrating the local OpenTelemetry Collector to export telemetry directly to GCP Cloud Monitoring and Trace via a secondary configuration file.
Data Expansion & Machine Learning: Scaling the ingestion pipeline to support additional sports leagues and integrating predictive ML models to forecast game outcomes and player statistics.