Skip to main content
MMujtaba
Back to Projects

CourseOptions — Intelligent Course Discovery Engine

Web AppSaaS
Next.jsNode.jsElasticsearchPostgreSQLAWSRedis

50K+

Courses Indexed

200+

Providers

500K+

Active Learners

<100ms

Search Speed

Overview

CourseOptions is an intelligent course discovery engine that aggregates online learning content from 200+ providers — Coursera, Udemy, edX, LinkedIn Learning, and dozens more — into a single, searchable, personalized platform. Half a million learners use it to find the right course for their career goals without spending hours comparing platforms.

I joined as Senior Full-Stack Engineer and owned the ingestion pipeline, the Elasticsearch search layer, and the Next.js frontend from day one.

The Challenge

The data problem was formidable: 200+ providers with wildly inconsistent data schemas, update frequencies ranging from real-time to monthly, and course attributes that didn't map cleanly across platforms. A Coursera "specialization" and a Udemy "course" are structurally different products — our schema had to normalize them without losing fidelity.

On the search side, users expected Google-level relevance. A query for "machine learning for beginners" needed to surface beginner-appropriate courses first, account for course ratings and freshness, and de-rank duplicates from multiple providers. Standard Elasticsearch out-of-the-box scoring wasn't enough.

Architecture & Technical Decisions

Multi-Provider Ingestion Pipeline

I built a plugin-style ingestion architecture where each provider has a dedicated adapter implementing a standard interface. Adapters handle auth (OAuth, API keys, scraping where unavoidable), rate limiting, and schema normalization. A central orchestrator schedules jobs per provider based on their update cadence, using BullMQ for reliable execution and retry logic.

  • 200+ provider adapters with pluggable auth strategies
  • Canonical course schema with provider-specific metadata stored as JSONB
  • Deduplication via fingerprinting (title + provider + URL hash) before indexing
  • Dead-letter queue for failed ingestion jobs with Slack alerting
  • Full re-index triggered nightly; incremental sync every 4 hours for top providers

Elasticsearch Search & Relevance Tuning

The search layer used a custom scoring model built on top of Elasticsearch's BM25 baseline. I added function score queries that boosted results based on: average rating (weighted by review count), freshness (recency decay), provider reputation score, and enrollment velocity. Query-time boosting for beginner/advanced tags based on inferred user level from their history completed the relevance stack.

  • Multi-field search across title, description, instructor, and tags with per-field boosts
  • Function score query with Gaussian decay for recency and sigmoid for rating confidence
  • Synonym expansion ("ML" → "machine learning") via custom analyzer
  • Auto-complete via edge n-gram tokenization on title field
  • A/B tested scoring weights against click-through rate — improved CTR by 28%

Frontend Performance

The Next.js frontend used ISR (Incremental Static Regeneration) for course detail pages, pre-rendering the top 10K most-visited courses at build time and regenerating on a 1-hour stale window. Search results pages used streaming SSR to send the page shell immediately while the Elasticsearch query resolved. Core Web Vitals: LCP 0.9s, CLS 0, FID <50ms.

Results

  • 50K+ courses indexed from 200+ providers with <0.1% schema normalization errors
  • Search p95 latency: 94ms including Elasticsearch query + Redis cache check
  • 500K+ monthly active learners with 4.1-minute average session duration
  • Recommendation click-through rate 3.2x higher than generic browse
  • Ingestion pipeline reliability: 99.7% job success rate over 12 months

What I Learned

Search relevance is a product problem before it's a technical problem. The best Elasticsearch configuration in the world won't save you if you don't understand what "good result" means to your users. Instrumenting every search interaction — what users clicked, what they refined, what they abandoned — and feeding that signal back into scoring weights was as valuable as any infrastructure decision.

Related Projects