

The thing most “Arabic AI” gets wrong
Open most “Arabic AI” products in 2026 and you'll find an English product with Arabic strings. The buttons are right-to-left. The fonts work. The model speaks back in respectable Modern Standard Arabic. Underneath, the assumptions about how people decide, defer, document, and disagree are entirely Western.
That gap — between Arabic as a language and Saudi as a way of working — is not a localization problem. It's a research problem, and it has to be solved before the product is built, not retrofitted at the end.
What “translation thinking” looks like
A few patterns we keep seeing:
Quick-approval flows that assume one signoff. In Saudi enterprises, approvals usually move through several layers — junior to senior, technical to executive, sometimes back through legal. A product that doesn't show that chain doesn't show what's actually happening.
Notifications scheduled to the wrong week. The work week runs Sunday to Thursday. Prayer times shape the day. Friday is the day, not the eve. A scheduler that doesn't know this schedules into voids.
Tone defaults that mistake direct for confident. In Saudi professional correspondence, the relationship — and the people around the email — are part of what's being communicated. A model trained to be concise reads as cold. A model trained on US Slack reads as rude.
Trust hierarchies that get flattened. Decisions referenced by name carry weight; decisions referenced by department often do not. AI summaries that flatten “the engineering team decided X” lose information a Saudi reader would have used to act.
None of these break the product. They just make it feel like it was made somewhere else — because it was.
What changes when you build for the actual context
The work moves from translation to construction. Some of what that means in practice.
Evaluation comes from a Saudi user, not a benchmark. Public NLP benchmarks for Arabic measure linguistic fluency. They don't measure whether the model knows when to use a deferential register, when to name a decision-maker, when to wait. We build evals from real Saudi interactions, with Saudi reviewers, before any feature ships.
Calendar and time logic are first-class, not afterthoughts. Hijri and Gregorian dates coexist. Prayer times depend on geolocation. The week starts on Sunday. An agent that has to be told this every prompt is failing at its job.
Voice matters more than diction. Tone, register, and the choice of dialect or MSA are decisions, not defaults. A product talking to a senior Saudi attorney sounds different from one talking to a Saudi university student. The same model, two different products.
Hierarchy is preserved, not abstracted. When the agent summarises a decision, it tells you who made it, who they consulted, and who needs to know. Removing that information is not simplification; it's data loss.
Why this matters more in regulated markets
In an unregulated consumer product, cultural mismatch costs you adoption. In a regulated B2G or enterprise product, it costs you trust — and trust here, once spent, is expensive to recover. The institutions we work with do not give AI products a second first impression. The work of getting localization right is the work of being worth a first one.
What we want from the field
More work that starts with the user, not with the model. More products built with Saudi reviewers from the first sprint, not the last week. More benchmarks that test whether the agent fits the workflow, not just whether it reads the alphabet. We're building toward that, and we expect others will too.
If you're working on AI that has to land in Saudi context, or you're a researcher thinking about cultural localization as a real problem rather than an internationalization checklist, we'd like to compare notes.


