Executive Summary
LumaChat tidak lahir dari arsitektur sempurna. Ia lahir dari constraint nyata: satu VPS dengan 4 core, 8GB RAM, tim kecil (2-3 developer), dan kebutuhan untuk bertahan di production dengan budget terbatas.
Setelah 160+ production releases, saya belajar bahwa arsitektur yang baik bukan yang paling canggih, tapi yang survive di real world. Artikel ini membedah keputusan arsitektur kritis, kegagalan yang terjadi, dan pembelajaran dari deployment production nyata.
Cakupan:
- WebSocket vs SSE: Keputusan dan konsekuensinya
- Redis crash: 0.3% message loss dan recovery strategy
- Polling overhead: CPU spike di 3K users
- Scaling strategy: 1K → 100K users dengan budget minimal
Architecture Evolution Journey
Version Timeline & Real Incidents
v1.0 (Nov 2025) - MVP Launch
├─ Architecture: Polling only (2s interval)
├─ Infrastructure: Single worker process, SQLite
├─ Capacity: ~500 concurrent users
└─ Incident: None (low traffic)
v2.0 (Dec 2025) - Production Scale
├─ Architecture: PM2 Cluster (4 workers), Redis pub/sub
├─ Infrastructure: PostgreSQL migration, MinIO storage
├─ Capacity: ~1,200 concurrent users tested
└─ 🔥 Incident: Christmas spike (25 Dec)
- 1,500 concurrent users
- CPU 95%, latency 800ms
- Fix: Increased polling interval
v2.5 (Jan 2026) - Reliability Improvements
├─ Architecture: Pending queue fallback, Redis limits
├─ Infrastructure: Connection pool tuning (max 20)
├─ Capacity: ~2,000 concurrent users
└─ 🔥 Incidents:
- Redis OOM crash (15 Jan): 0.3% message loss, 4 min downtime
- Certificate expiry (18 Jan): 2 hours downtime
- Fix: maxmemory config, monitoring alerts
v3.0 (Feb 2026) - Performance Optimization
├─ Architecture: Polling 2s → 5s, structured logging
├─ Infrastructure: Health checks, metrics collection
├─ Capacity: ~3,000 concurrent users
└─ 🔥 Incident: Polling overhead (22 Feb)
- CPU 45% → 78% at 3K users
- Latency 200ms → 3.2s
- Fix: Increased interval, identified bottleneck
v4.0 (Q2 2026) - Planned Evolution
├─ Architecture: PostgreSQL LISTEN/NOTIFY
├─ Infrastructure: DB read replicas, Redis cluster
├─ Capacity Target: 10K concurrent users
└─ Goals: Event-driven, 90% reduction in polling queries
Key Insight: Setiap versi adalah response terhadap real production pain, bukan theoretical optimization.
Incident Summary (2025-2026):
- Total Incidents: 6
- Total Downtime: 6h 34min
- Availability: 99.92% (target: 99.9%)
- Message Loss: 0.3% (single incident)
- Lessons Learned: 6 major architectural improvements
Conclusion
Kebanyakan artikel arsitektur chat menunjukkan diagram sempurna dengan "infinite scalability". Artikel ini menunjukkan real constraints, real failures, real evolution.
Architecture Decisions:
- WebSocket + WebRTC: Separation of concerns untuk messaging vs calls
- PM2 Cluster + Redis: Horizontal scaling tanpa Kubernetes complexity
- PostgreSQL: ACID transactions non-negotiable untuk message ordering
- Polling → Event-driven: Learned at 3K users, planned for v4.0
- Incident-driven evolution: Setiap failure menghasilkan architectural improvement
Implementation Principles:
- Start simple: Single server + PM2 sufficient untuk 1K-10K users
- Battle-tested over novel: WebSocket, PostgreSQL, Signal Protocol
- Design for failure: Reconnection, pending queue, fallback mechanisms
- Observability first: Structured logging, metrics, health checks
Reflection:
"Incident mengubah asumsi menjadi data. Arsitektur berkembang bukan karena teori, tetapi karena kegagalan yang terukur.
Redis OOM: 0.3% message loss → Pending queue implementation.
Polling overhead: CPU 78% at 3K users → Event-driven migration planned.
Certificate expiry: 2 hours downtime → Monitoring automation.Setiap production failure menghasilkan satu architectural improvement. Sistem ini adalah hasil dari 6 incidents, 160+ releases, dan 99.92% uptime yang dipertahankan melalui iterasi berbasis data."
Contact: emylton@leunufna.xyz
Source: LumaChat v3.1.99 (Build 1989) - 160+ production releases
Appendix: Technology Stack
| Layer | Technology | Purpose |
|---|---|---|
| Backend | Node.js 18 + Express 4.18 | Event-driven REST API |
| ws 8.16 | WebSocket server | |
| PostgreSQL 14+ | Primary data store | |
| Redis 7+ | Pub/sub + caching | |
| MinIO | S3-compatible storage | |
| PM2 | Cluster management | |
| Frontend | Flutter 3.8.1+ | Cross-platform UI |
| Riverpod 3.2 | State management | |
| Isar Community 3.3 | Local database | |
| Signal Protocol | E2EE (X25519 + AES-GCM) | |
| flutter_webrtc 1.2 | P2P calls |