Paper and Dataset

Alexandria: English to Local Arabic Dialect Conversations at Scale

Alexandria is a community-built parallel dataset of English and dialectal Arabic conversations across high social-impact domains, featuring human-translated and human-revised content designed for culturally grounded Arabic MT, natural code-switching, and inclusive gender directions (male and female), while capturing diverse scenarios with specific personas and roles.

Open Paper PDF Access Dataset on 🤗 GitHub Repository

Coverage: Spans healthcare, education, legal and financial services, logistics, tourism, and other public-facing domains.
Structure: Preserves multi-turn conversational context with turn-level English-dialect alignment, enriched with persona metadata including roles and gender.
Linguistic Focus: Includes subdialects, naturally occurring code-switching across several countries, and registers adapted to specific conversation scenarios.

Dataset Summary

Built for culturally grounded Arabic MT

Conversations: 34,488
Parallel turns: 107,631
Participants: 55
Countries: 13
Domains: 11
Splits: Train / Development / Public Test / Private Test

Research Use of Alexandria

Supports model training, development, public test benchmarking, and private held-out testing for dialectal Arabic MT.
Provides multi-turn context suitable for turn-level, context-level, and conversation-level experimentation.
Covers subdialects for several Arabic countries rather than only broad national labels.
Supports naturally occurring code-switching in conversational settings.
Preserves country, domain, role, and gender-direction metadata for controlled analysis.

Paper Summary

Highlights from the paper

Scale and Coverage

Alexandria comprises 34,488 conversations and 107,631 turns, spanning 13 Arab countries and 11 high-impact social domains. It features city-level grounding and rich multi-turn conversational context.

Metadata Depth

Alexandria’s conversations are enriched with metadata for city-level dialects, persona roles, and speaker-addressee gender, supporting both context- and gender-aware evaluation.

Revision Quality Signals

The Alexandria Dataset was developed in two human-centric phases. In the first phase, participants translated the source data from English into specific Arabic dialects. This was followed by a review and revision phase, where 31.6% of the turns were modified.

Gender and Code-Switching

Alexandria features metadata for city-level varieties and gender-specific interactions, specifically female-to-male, male-to-female, male-to-male, and female-to-female. To ensure stylistic accuracy, registers were adapted based on the scenario and persona, enabling a context-aware evaluation that reflects natural code-switching and social dynamics.

Evaluation Setup

The paper evaluates 24 Arabic-capable LLMs under turn-level, context-level, and conversation-level prompting with metadata-aware inputs, and reports spBLEU and chrF++ for automatic assessment.

Main Benchmark Findings

Results show a strong directional asymmetry: dialect-to-English is easier than English-to-dialect. Human evaluation reports high gender adherence, while dialectness remains a major challenge for many systems.

Dataset Stats

Coverage and scale

107,631

Total human-translated English-to-dialect turns

Arab countries

Domains

Regional groups: Levant, Gulf, Nile, and Maghreb

Swipe horizontally in the table to view all countries and totals.

Domain	Levant				Gulf			Nile		Maghreb
Domain	JO	LB	PS	SY	SA	OM	YE	EG	SD	LY	MA	MR	TN
Agriculture/Farming	825	1,140	1,770	931	1,162	915	529	583	163	231	570	970	481
Commerce/Transactions	750	1,004	1,595	749	1,020	650	579	506	201	160	445	757	401
Construction/Real Estate	859	995	1,761	861	1,161	974	696	660	225	271	574	673	485
Education/Academia	816	1,191	1,513	831	1,017	1,079	563	549	170	220	601	863	551
Energy/Resources	786	1,048	1,715	928	1,177	937	587	625	189	243	447	719	470
Everyday/Social	967	1,215	1,697	787	1,020	888	642	604	175	210	595	824	550
Healthcare/Medical	727	1,240	1,728	781	1,043	895	548	487	164	253	556	948	522
Legal/Financial	693	1,006	1,566	757	857	753	496	539	177	174	481	642	412
Logistics/Transport	842	1,020	1,512	950	1,234	842	629	646	189	187	593	877	515
Professional/Workplace	845	1,220	1,810	959	1,112	866	549	645	178	253	480	709	526
Tourism/Hospitality	720	1,161	1,596	884	1,004	815	608	608	190	216	567	878	460
Total	8,830	12,240	18,263	9,418	11,807	9,614	6,426	6,452	2,021	2,418	5,909	8,860	5,373

Qualitative Examples

Samples from Alexandria Dataset

Country Domain

Resources

Alexandria Resources

Available now

Paper

Read the full Alexandria research paper on arXiv.

Open arXiv

Available now

Data

Browse the Alexandria dataset release on Hugging Face, including splits and hosted dataset files.

Open Dataset on 🤗

Available now

Code

Access the Alexandria project repository on GitHub for evaluation code, updates, and related project materials.

Open GitHub

Authors

Alexandria team

Citation

How to cite Alexandria

If you use Alexandria or the associated methodologies, please cite the original paper using the following BibTeX entry.

@misc{mekki2026alexandriamultidomaindialectalarabic,
  title={Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs},
  author={Abdellah El Mekki and Samar M. Magdy and Houdaifa Atou and Ruwa AbuHweidi and Baraah Qawasmeh and Omer Nacar and Thikra Al-hibiri and Razan Saadie and Hamzah Alsayadi and Nadia Ghezaiel Hammouda and Alshima Alkhazimi and Aya Hamod and Al-Yas Al-Ghafri and Wesam El-Sayed and Asila Al sharji and Mohamad Ballout and Anas Belfathi and Karim Ghaddar and Serry Sibaee and Alaa Aoun and Areej Asiri and Lina Abureesh and Ahlam Bashiti and Majdal Yousef and Abdulaziz Hafiz and Yehdih Mohamed and Emira Hamedtou and Brakehe Brahim and Rahaf Alhamouri and Youssef Nafea and Aya El Aatar and Walid Al-Dhabyani and Emhemed Hamed and Sara Shatnawi and Fakhraddin Alwajih and Khalid Elkhidir and Ashwag Alasmari and Abdurrahman Gerrio and Omar Alshahri and AbdelRahim A. Elmadany and Ismail Berrada and Amir Azad Adli Alkathiri and Fadi A Zaraket and Mustafa Jarrar and Yahya Mohamed El Hadj and Hassan Alhuzali and Muhammad Abdul-Mageed},
  year={2026},
  eprint={2601.13099},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2601.13099},
}