Skip to main content

Observability

GreekManage's observability story today is container logs + audit logs. There's no APM, no distributed tracing, no metrics dashboard yet. This page documents what exists and what's planned.

Current state

ConcernToolWhere
Application logsContainer stdout (Django + Celery)kubectl logs <pod>
Database logsPostgres stdoutkubectl logs postgres-...
Audit logDB table + S3 archiveAuditLog model + S3 daily archive
Error trackingNone
MetricsNone
TracingNone
UptimeNone
Security scansOWASP ZAPE2E + nightly

Application logging

Django + Celery write to stdout in JSON format (configurable via LOGGING in settings.py). In production:

kubectl logs -n greekmanage deploy/backend --tail=200 -f
kubectl logs -n greekmanage deploy/celery --tail=200 -f

For aggregated viewing across pods, use your cluster's log aggregator (Loki, Cloudwatch Logs, GCP Logging — depends on your platform).

Log levels

ModuleLevelWhy
DjangoINFORequest lifecycle, ORM warnings
CeleryINFOTask start / complete
apps.common.middleware.AuditMiddlewareINFOAudit log writes
apps.ai_servicesINFOLLM calls (tokens, latency)
apps.authentication.encryptionWARNKey rotation events
Third-party libs (urllib3, etc.)WARNReduce noise

Override per environment via LOG_LEVEL env var.

Audit log

The single source of truth for "who did what when":

  • Every sensitive write logged automatically by AuditMiddleware
  • Append-only at the DB level (app-tier user has no UPDATE/DELETE on the table)
  • Daily archive to S3 (gzipped JSONL, partitioned by org / year / month / day)
  • Per-org retention (default 180 days in DB; longer in S3)

Audit & encryption

Health checks

Backend exposes /api/health/:

GET /api/health/
200 OK
{
"status": "healthy",
"db": "ok",
"redis": "ok",
"version": "0.57.0"
}

Returns 503 if any dependency is down. Used by:

  • Kubernetes readiness + liveness probes
  • External uptime monitors (when you wire one up)

What's missing (priority order)

Why first: production errors are silently lost in container logs unless you tail them. Sentry de-dupes, groups by stack, alerts on first occurrence.

Setup:

pip install sentry-sdk
# settings.py
import sentry_sdk
from sentry_sdk.integrations.django import DjangoIntegration
from sentry_sdk.integrations.celery import CeleryIntegration

if SENTRY_DSN := env("SENTRY_DSN", default=None):
sentry_sdk.init(
dsn=SENTRY_DSN,
integrations=[DjangoIntegration(), CeleryIntegration()],
traces_sample_rate=0.1,
send_default_pii=False, # don't ship PII
environment=env("DJANGO_ENV", default="development"),
release=env("RELEASE_VERSION", default="dev"),
)

Frontend equivalent for browser errors:

import * as Sentry from "@sentry/react";

Sentry.init({
dsn: import.meta.env.VITE_SENTRY_DSN,
integrations: [Sentry.browserTracingIntegration()],
tracesSampleRate: 0.1,
environment: import.meta.env.MODE,
});

2. Metrics — Prometheus + Grafana

Why second: capacity planning, performance regressions, SLO tracking.

Setup:

pip install django-prometheus celery-prometheus
# settings.py
INSTALLED_APPS += ["django_prometheus"]
MIDDLEWARE = [
"django_prometheus.middleware.PrometheusBeforeMiddleware",
*MIDDLEWARE,
"django_prometheus.middleware.PrometheusAfterMiddleware",
]

Endpoint: /metrics (restrict to cluster-internal scrape).

Useful initial metrics:

  • django_http_requests_total{method, view, status} — request rate, error rate
  • django_db_query_duration_seconds — slow queries
  • celery_task_runtime_seconds{task} — task latency
  • process_resident_memory_bytes — memory leaks

Grafana dashboards: import community dashboards for Django + Celery, customize per app.

3. Distributed tracing — OpenTelemetry

Why third: trace a single request across backend → DB → Celery → external API.

Setup:

pip install opentelemetry-distro opentelemetry-instrumentation-django \
opentelemetry-instrumentation-celery opentelemetry-instrumentation-psycopg2
opentelemetry-bootstrap --action=install
# settings.py — auto-instrumentation via env
# OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
# OTEL_SERVICE_NAME=greekmanage-backend

Backends: Tempo, Jaeger, Honeycomb, Datadog (your choice).

4. Uptime monitoring — Better Uptime / Pingdom / Healthchecks.io

External GET against /api/health/ every 60s. Alerts on 3 consecutive failures.

Cheap, easy first win — set up before any of the above.

5. Log aggregation — Loki / Grafana Cloud Logs

Central place to search across all pods. Set up at the cluster level:

  • Promtail → Loki (self-hosted)
  • Grafana Cloud Logs (managed)
  • AWS CloudWatch Logs (if on AWS)

Tag logs by pod + app for filtering.

6. Frontend RUM — Web Vitals

Capture Core Web Vitals (LCP, FID, CLS) via the web-vitals package and ship to Sentry / Datadog / your analytics:

import { onCLS, onFID, onLCP, onINP, onTTFB } from "web-vitals";

const send = (metric) => {
navigator.sendBeacon("/api/web-vitals/", JSON.stringify(metric));
};
onCLS(send); onFID(send); onLCP(send); onINP(send); onTTFB(send);

Cost considerations

ToolCost (approx)Notes
SentryFree up to 5K events / month, then $26+/moMost affordable error tracking
Better UptimeFree for 10 monitorsSimplest uptime tool
Healthchecks.ioFree for 20 checksCron / job monitoring
Grafana CloudFree tier 10K series, 50GB logsEasiest hosted Prometheus + Loki
Datadog$15/host/mo + extrasMost expensive but most integrated

For a typical org running GreekManage in production, Sentry + Better Uptime + self-hosted Prometheus is a reasonable starting baseline (~$30/mo).

Manual debug techniques (until tooling lands)

Tail logs across pods

kubectl logs -n greekmanage -l app=backend --tail=200 -f --max-log-requests=10

Inspect a slow query

Use EXPLAIN ANALYZE in psql:

kubectl exec -it -n greekmanage postgres-... -- psql -U greekmanage greekmanage
EXPLAIN ANALYZE
SELECT * FROM members_memberprofile mp
JOIN organizations_membership m ON mp.membership_id = m.id
WHERE m.chapter_id = 'uuid-here';

Check Celery queue depth

kubectl exec -it -n greekmanage redis-... -- redis-cli LLEN celery

A growing queue means workers are saturated → scale up celery deployment replicas.

Find a request in logs by request ID

Backend includes X-Request-Id header in responses. Grep for it across logs:

kubectl logs -n greekmanage -l app=backend | grep "abc123-..."

(Add request_id to every log line via Django middleware — see apps/common/middleware.py.)

Roadmap (rough priority)

  1. Sentry integration — biggest immediate value
  2. Uptime monitoring — cheap, fast
  3. Prometheus + Grafana — for capacity + perf
  4. Log aggregation — when scale demands it
  5. Distributed tracing — when LLM / external API debugging gets hard
  6. Frontend RUM — when UX perf becomes a focus

Tickets to track each are in the issue tracker (observability label).