The quiet Rust service that answers the phone, remembers every customer, and never sleeps.
Where we're going
The tour / 4 stops
1 The lay of the land
What it is, the stack, how it ships, what it actually does all day.
2 Who are you?
Turning a mess of human data into one profile per person.
3 Scale
Doing that matching across hundreds of thousands of records without melting the database.
4 Reliability
Why a redeploy doesn't drop a live phone call.
Stops 2–4 are the real meat. Stop 1 is so they make sense.
Stop 1 · The lay of the land
What even is the CRM?
It's the comms brain for the business.
☎️ Answers & routes calls - a customer rings, it finds the right available staff member and bridges them in a conference.
💬 Handles SMS - inbound/outbound texting, tied to the right person.
🗂 Builds one profile per customer - from orders, bookings & appointments scattered across 4 systems.
🔌 Glues vendors together - Front, Twilio, SimplyBookMe, Shopify, Salestools.
The staff-facing phone client (desktop app) talks to it over a WebSocket. When your screen lights up with "Sarah is calling - here's her last 3 orders", that's this service.
/health check every 10s; the old version only steps down once the new one is healthy
Image tagged by git commit → trivial rollback
Used to run on Fly.io - moved off, it was too unreliable for always-on calls
Start-first + health checks = a bad deploy never takes traffic. More on why that matters at Stop 4.
Stop 1 · The headline feature
A call, start to finish
Ringingcustomer calls in
→
FindingStaffwho's free + preferred?
→
DialingStaffring a person
→
Conferencebridge both legs
→
Voicemailnobody home
Calls are modelled as an explicit state machine - every call is always in exactly one known state. Twilio webhooks drive transitions; the conference is how we bridge customer ↔ staff.
The compiler won't let us forget a state. Each call runs its own async task (call.run()) and receives actions over a channel - an actor. No shared mutable mess, no locks around a call.
Deep dive · Stop 2
Who are you?
The hardest problem wasn't code. It was people.
Stop 2 · The mess
Four systems, one human
The same customer shows up as different rows in different tools, none of which agree:
Source
What it is
What's reliable
order
Shopify / POS purchases
email-ish, name-ish
sbm
SimplyBookMe bookings
name crammed in one text field
aj
AppointJet - the primary on a booking
richest source
aj_partner
the second person on a couple's booking
partial, often shares email
A couple books one appointment → that's two people we must keep apart, often sharing a single email address. 💍
Stop 2 · Cleaning
Step 1 - make it comparable
Before matching anything, every raw row is normalised:
📞 Phone → parsed to E.164 (+61…), AU default; junk ending in 000000 binned
🔤 Names → trimmed; curly ’ vs straight ' normalised so O'Brien matches
No email AND no phone? → dropped. Nothing to match on.
Then split by how much we know:
Full identity 2+ fields → joins the main matching. Weak identity 1 lonely field (a bare email import) → set aside for later.
Stop 2 · Scoring
Step 2 - weigh the evidence
We don't use rigid rules. We score with Fellegi–Sunter - every field is evidence for or against "same person", summed into a probability.
Field
Exact match
Mismatch
Why
📞 Phone
+10.81
−4.32
hard to fake
📧 Email
+10.81
−4.32
strong
Last name
+10.64
−4.32
surnames rarely collide
First name
+6.12
−3.25
"John" collides constantly
Sum the weights → logistic → probability. Merge only above 0.99. The system would rather leave a duplicate than fuse two people.
Stop 2 · The "aha"
The case that breaks naive systems
jane & john @ shared inbox
Same email. Different first and last names.
Naive "match on email" → one merged blob. 💀
Ours: −3.25 − 4.32 on the names overpowers the shared email → kept as two people. ✅
a bare email import
One field, no name, no phone.
Can't join main matching (too little info).
Ours: held back, then attached to whichever existing profile already has that exact email on a real record.
A shared email is a hint, not a verdict.
Stop 2 · The whole pipeline
From four messy sources to one profile
flowchart LR
O[("orders")] --> ING
S[("SimplyBookMe")] --> ING
A[("AppointJet")] --> ING
P[("AJ partner")] --> ING
ING["ingest"] --> CL["clean and normalise"]
CL -->|"2+ fields"| F[["full identity"]]
CL -->|"1 field"| W[["weak, held back"]]
F --> M["block, score, DSU union prob over 0.99"]
M --> PR["build profiles consensus name / email / phone"]
PR --> RE["re-match existing respect rejections"]
W -.-> AT["attach weak to best profile"]
PR --> AT
RE --> CU["safety nets orphan cleanup, un-merge over-merged"]
AT --> CU
classDef src fill:#1f3a29,stroke:#D3E6D9,color:#fefaf5
classDef proc fill:#173d24,stroke:#00BB33,color:#fefaf5
classDef sink fill:#3a3026,stroke:#CDA17F,color:#fefaf5
class O,S,A,P src
class ING,CL,M,RE,AT proc
class PR,CU sink
Continuous: new data flows in, profiles split & merge, "no"s are cached, orphans are swept. How it all stays fast at our size is the next stop. Full write-up: client_sync/MATCHING.md.
Stop 2 · Takeaway
What the data taught us
Garbage is the default. Half the engine is just making rows comparable.
Be probabilistic, be cautious. Score evidence; when unsure, don't merge.
Model the real world. Couples share emails - so a shared email can't be proof.
Make it reversible. Over-merged a profile? There's an un-merge safety valve.
The code is small. The judgement encoded in it is the product.
Learning · Stop 3
Scale
"Just compare everyone to everyone" is a trap.
Stop 3 · The problem
The numbers don't forgive you
Customer records
100k+
orders + bookings + appts
Naive comparisons
~5B+
every pair (n²/2)
Sources that disagree
4
orders · SBM · AJ · partner
To decide "are these two records the same person?" the obvious approach is compare every record to every other. With n records that's n².
n² on 100k records is ~5 billion comparisons. Per run. Every few minutes.
Stop 3 · The fix
Trick 1 - Blocking
Two records can't be the same person unless they share a contact channel - same email or same phone. So only ever compare records that land in the same bucket.
// build hash indexes once: contact → records
email_idx: HashMap<String, Vec<usize>>
phone_idx: HashMap<String, Vec<usize>>
// then only compare within a shared bucket
for A in records {
for B in email_idx[A.email] ∪ phone_idx[A.phone] {
if score(A, B) > 0.99 { dsu.union(A, B) } // ← match!
}
}
We go from "everyone × everyone" to "only people who already share a phone or email". The 5 billion collapses to a few comparisons each.
Stop 3 · A spanner in the works
Then we broke our own fix
We started using the order system internally to prep ready-to-ship rings. Every one of those orders booked under a single internal account: one email, one phone.
real contact2 to 3 records
real contact1 to 4 records
internal accountthousands of orders, all one bucket
Blocking only helps if contacts are spread out. One bucket holding thousands of records means comparing every pair inside it: O(m²) all over again. We were right back to billions of comparisons on a single run.
A blocking key is only as good as its worst bucket.
Stop 3 · The rework
Trick 2 - Union-Find (DSU)
Matches are transitive: if A and B match, and B and C match, all three are one person, even if A and C share nothing directly.
Aname + phone
↔
Bphone + email
↔
Cemail + name
A Disjoint-Set Union groups them in near-constant time per merge, and skips any pair already in the same component. This is what let us drop the explicit edge graph (build every link, then traverse it): union-find never has to materialise a billion-edge graph.
A ↔ B ↔ C ⟹ one profile, found cheaply.
Stop 3 · The rework
Trick 3 - Remember your "no"s
The sync runs continuously, so that giant internal bucket is the same expensive comparisons every single run. Most pairs that could match (share a contact) actually don't. Re-deciding that forever is wasted work.
Without
Re-compare the same household pair forever. Same answer. Same cost. Every run.
With a rejection cache
A "no" is written to profile_merge_rejection. We skip that pair - until new evidence arrives for one of them, then we re-check.
Caching the negative result is as valuable as computing the positive one.
Stop 3 · Takeaway
What scale taught us
▣
Block first narrow the candidates before you compute anything
⋃
Right structure DSU turns transitive merges into near-O(1)
⊘
Cache the "no" with an invalidation rule, not forever
A correct algorithm that's O(n²) is a wrong algorithm at our size.
Learning · Stop 4
Reliability
You can't ask a customer mid-call to "hold while we redeploy".
Stop 4 · The stakes
Live calls vs. shipping code
We deploy whenever. But at any moment there might be live phone calls in progress. A process restart can't just vaporise them.
❌ Naive: call state lives in memory → redeploy → call drops, customer hears silence.
✅ Ours: every call's state is persisted to Postgres as it transitions.
The in-memory actor is a cache of a row in the phone_call table - not the source of truth.
Stop 4 · The recovery
resume_state() - the comeback
On boot, before taking traffic, the service rebuilds reality:
1 · Loadall calls where completed = false
→
2 · Ask Twiliois this call actually still alive?
→
3 · Reconcileresume the actor, or clean it up
The magic is step 2: we don't trust our own DB blindly. We ask Twilio - the real source of truth for telephony - whether each call is still ringing. Dead ones get cleaned up; live ones get their actor + state machine rebuilt and nudged back on track.
Stop 4 · Reliability has consequences
Slow isn't slow. It's broken.
Lean on someone else's platform and their limits become your correctness bugs. Being late here doesn't degrade gracefully, it gets you switched off.
Front's 5-second axe
Take longer than 5s to answer a webhook and Front disables it, silently. We simply stop receiving comms.
Our average on that endpoint: 18ms. But network variance occasionally spikes, so FrontCheck watches for a disabled webhook and pings staff the instant it happens.
Twilio's no do-overs
Miss a Twilio webhook and a live call can drop. Miss a single status update and the call's state desyncs, ruining it.
There is no retry budget on a phone call. Late equals lost.
A 5-second budget we spend 18 milliseconds of.
Stop 4 · The supporting cast
Reliability is layered
Deploys can't hurt you
Start-first rollout: new version proves /health before the old one steps down
Auto-rollback on failure
No window where zero healthy instances serve traffic
Syncs heal themselves
Front comms sync is cursor-based - resumes from the last event, never double-counts
Respects API rate limits (parses "retry in N ms" and waits)
FrontCheck pings staff if Front ever disables our webhook
Assume the process will die. Make restart boring.
Stop 4 · The receipts
Does it actually work?
Uptime · 2026 YTD
99.998%
Total downtime
240 sec
~4 minutes since Jan 1
Availability tier
4 nines+
Avg response · across the board
<100 ms
Heaviest endpoint · recording processing
~6 s
still beats our other tools' averages
240 seconds down all year, and we answer in under 100ms.
Wrapping up
Three lessons, one service
Messy humans
Real data is the hard part. Score evidence, stay cautious, model couples.
Scale
Don't out-compute a bad shape. Block, pick the right structure, cache the "no".
Reliability
Truth lives in the DB. Reconcile with reality on boot. Make restart boring.
It answers the phone. It remembers the customer. It survives a redeploy. That's the job.
That's the tour ◆
Questions?
Pick a stop and we'll go deeper - calls, matching, deploys, whatever.
Code: crm/src/controllers/ · Deep dive: client_sync/MATCHING.md