SEV2 — Pennington Mutual settlement batch delayed. @here investigating. ~500k in payments stuck. started seeing errors around 09:45. pulling logs now
on it. what's the error signature?
settlement_processor throwing `DeadlockDetectedException` on the batch_jobs table. frequency is increasing
pulling query stats from pg_stat_activity now. give me 5 min
checking connection pool metrics. infra dashboard is showing pool utilisation at 94% on settlement-db-01
heads up — Pennington relationship manager has already emailed. keeping them on hold until we have something concrete
found it. there's a long-running transaction that's been open since 09:43 — looks like the batch reconciliation job started before the previous one finished. two jobs holding locks on the same rows in `settlement_entries`
so the batch scheduler fired the job twice?
looks that way. the idempotency check is supposed to prevent that but it's not working if the previous job is still in-flight. the lock contention is cascading — new transactions are piling up waiting for the deadlock to resolve
connection pool just hit 100%. new connections are being rejected. this is going to start affecting other clients soon not just Pennington
ok priority is to kill the stuck transaction and drain the queue. lena, can you identify the pid?
pid 47832. it's been idle-in-transaction for 41 minutes. safe to terminate
go ahead and kill it. ryan — get ready to restart the connection pool manager once lena clears the lock
terminated. monitoring pg_locks now to confirm cascade clears
pool utilisation dropping. 87%... 72%... ok it's draining
settlement processor is picking up again. seeing successful batch completions in the logs
good. how many transactions are still in the backlog?
roughly 340 still queued. at normal throughput we're looking at ~90 minutes to clear. no further errors in the last 5 minutes
update for priya to pass to Pennington: root cause identified and resolved. settlement processing resumed. full backlog expected to clear by 12:15 UTC. no data loss
restarting the batch scheduler with the duplicate-job guard enabled. should have been on by default tbh — that config flag is a trap
backlog is 60% cleared. all transactions processing cleanly
backlog fully cleared. 09:43 to ~11:40, total duration about 2 hours. all Pennington settlements confirmed. marking SEV2 resolved
nice work everyone. tom, can you write up the postmortem doc and pin the resolution here?
RESOLVED — SEV2 Pennington Settlement Delay (2026-03-23) Duration: 09:43–11:47 UTC (~2hr 4min) Impact: ~500k in settlement payments delayed for Pennington Mutual. No data loss. Root cause: Batch scheduler fired duplicate reconciliation job due to misconfigured idempotency guard. Second job acquired overlapping row locks causing deadlock cascade. Connection pool exhaustion secondary effect. Resolution: Terminated hung transaction (pid 47832), drained connection pool, re-enabled duplicate-job guard in batch scheduler config. Actions: Batch scheduler config reviewed and corrected. PR raised to enforce idempotency guard as non-optional. Postmortem doc in Notion: [INC-2026-031]
Pennington confirmed all payments settled. they were understanding. good comms throughout
reminder to everyone: if connection pool utilisation goes above 80% you'll see it in Grafana dashboard infra-db-pool. I've set an alert at 85% now. should've done this sooner
weekly check — all systems nominal. settlement processing, KYC, and payment rails all green
all quiet this week. nothing to report
heads up — I'm seeing elevated error rates from Onfido on the KYC verification endpoint. rate limiting errors for Broadgate Financial onboarding flow. started about 20 minutes ago
how bad? are new KYC checks failing outright or just slowing down?
roughly 40% of requests returning 429. the rest are succeeding but with retries. Broadgate onboarding is partially degraded — applicants are seeing longer wait times and some are getting error screens
ok logging this as SEV3 KYC degradation. david-kimura FYI — Broadgate Financial onboarding affected, not a full outage but worth flagging to their account team
on it. notifying Broadgate account manager now. what's the fix path?
checking our Onfido rate limit tier. we're on 100 req/min and Broadgate's onboarding batch this morning is pushing 130+. they must have had a marketing campaign or something — volume spike. we need to either throttle our requests or upgrade the Onfido tier
can we add request queuing on our side to smooth it out within the 100/min limit?
yes — we have a basic rate limiter in the KYC service but it's not enabled by default. let me push a config change
rate limiter config deployed. KYC requests now smoothed to 90 req/min to give headroom. error rate dropping
Onfido 429s have stopped. all KYC checks going through cleanly. marking resolved. duration ~30 min. low impact — no failed verifications, just delays
Broadgate account manager says their client noticed some slowness but no escalation. good outcome
I'll raise a ticket to get the KYC rate limiter enabled by default and to review our Onfido tier — current limit is too tight for growth
just did a manual check on alerting coverage and we have zero PagerDuty alerts fired in the last 2 weeks. zero. checked the integration and the API token expired on april 3rd. we've been flying blind
wait what. the token expired two weeks ago and nobody noticed? I thought you were handling the PD token rotation ryan
I thought YOU were. you set up the integration originally. I've never had visibility on the token expiry
I set it up 8 months ago, I haven't touched it since. there was no handover
well someone should have documented who owns it. this is exactly the kind of thing that falls through the gaps
ok let's pause on the blame thread. the immediate problem is we have no alerting right now. ryan, can you regenerate the token and restore the integration today?
yes. doing it now
once that's done I want to spend 20 minutes this afternoon mapping out all our critical integration tokens and who owns them. we'll add expiry reminders to the ops calendar and assign a named owner to each. tom and ryan — can you both join a quick call at 14:30?
token regenerated. PagerDuty integration is live. running a test alert now
test alert fired and received. we're back online
worth checking: were there any actual incidents in the last two weeks that we should have been paged for but weren't?
reviewed logs for the past 14 days. nothing that would have triggered a SEV1 or SEV2. we got lucky. the Onfido issue on Apr 3 was caught manually before it would have auto-paged anyway
Action items from 14:30 call: All integration tokens documented in Notion ops runbook with owner and expiry date. Calendar reminders set 30 days before each expiry. PagerDuty token: ryan-kelly is named owner. Onfido token: lena-park. Stripe webhook secret: tom-brennan. Monthly token audit added to ops checklist. Thanks both for getting this sorted quickly
for the record: not a production incident but this was a real gap. we were lucky. good catch ryan even if the conversation was a bit spicy
yeah fair. good outcome
weekly check — all good. no incidents, alerting confirmed healthy
flagging for visibility: seeing intermittent >500ms response times on Vertex Capital API calls. not consistent but happening enough that I'm watching it. p99 latency over the last hour is 620ms, up from baseline ~180ms
what endpoints specifically?
/v2/payments/initiate and /v2/accounts/balance — both read and write paths affected. started around 10:40
pulling traces from Jaeger. will look at the full request path
my first guess is connection pool exhaustion under load. Vertex integration uses the shared HTTP client pool and there's been a batch job running since 10:30. checking
traces show the latency is in the outbound HTTP call to Vertex, not in our processing. so it's either their side or something in the network path
connection pool looks fine actually. utilisation at 45%, no queue buildup. so not that
checked Vertex status page — no incidents reported. going to reach out to their integration team
is this affecting live payments or just latency?
all requests completing successfully, just slow. no failures. Vertex Capital payments going through but taking longer. keeping an eye on it
I've put a 30-minute rolling average on the Vertex API latency in Grafana. if it spikes above 1s I'd want to know immediately
latency has come back down. p99 now 210ms, basically normal. vertex integration team not yet responded. going to keep monitoring through EOD
keeping an eye on it overnight. if it spikes again we'll have more data to work with
update: saw another spike at 08:15 this morning, lasted about 12 minutes. p99 hit 780ms. then dropped back. pattern is intermittent — not load-correlated from our side
vertex integration team finally responded. they say they're not seeing anything on their end and their internal metrics look clean. they want us to send sample request IDs
sent them 15 request IDs from the two spike windows. waiting on response
I've been checking if there's a pattern — time of day, specific IP routing, anything. the 10:40 spike yesterday and 08:15 today don't share a load profile. could be something in their CDN or load balancer doing rolling restarts
is there a business impact we need to communicate to Vertex relationship team or are we ok to keep this at the technical level for now?
no failures so far, just latency. I'd say keep it technical for now but if we see failures or it persists past this week we escalate
vertex responded to the request IDs. they can reproduce slightly elevated latency in those windows on their trace system but have no explanation yet. escalating internally on their side
third spike this morning. 09:05–09:18 UTC. p99 peaked at 910ms. still no failures. this is getting annoying
yeah. three spikes, no root cause, vertex can't explain it. I'm going to document what we know and log this as an open investigation. not enough to call it a formal incident but it needs tracking
agreed. I'll add a spike detector alert — if p99 goes above 600ms for more than 5 minutes, it pages. at least we'll have good data capture going forward
also worth considering: do we have a circuit breaker on the Vertex client? if this degrades into actual failures we want to fail fast rather than queue up
we don't. I'll raise a ticket. it's a good call regardless of this specific issue
vertex integration team just sent a follow-up — they think it may be related to a BGP route change their network team made on Apr 26. they're reverting the change in their staging environment to test. no ETA on fix
BGP route change causing intermittent latency spikes — that would fit the pattern. not load correlated, short duration, happens at irregular intervals. makes sense
monitoring through end of week. if vertex confirms the fix I'll update here and close out the investigation.
no further spikes today. may be settling. keeping the monitor running
all quiet since wednesday. vertex confirmed BGP route revert deployed on their side thursday evening. p99 has been steady at ~190ms for 36 hours. closing out the investigation — root cause: third-party BGP routing change. no data loss, no payment failures
circuit breaker ticket is in backlog: PLAT-4421. spike alert is live in Grafana
weekly check complete. all integrations healthy, PD alerting confirmed active, no open incidents
all systems nominal. nothing to report