Scale and Coverage
Alexandria comprises 34,488 conversations and 107,631 turns, spanning 13 Arab countries and 11 high-impact social domains. It features city-level grounding and rich multi-turn conversational context.
Paper and Dataset
Alexandria is a community-built parallel dataset of English and dialectal Arabic conversations across high social-impact domains, featuring human-translated and human-revised content designed for culturally grounded Arabic MT, natural code-switching, and inclusive gender directions (male and female), while capturing diverse scenarios with specific personas and roles.
Paper Summary
Alexandria comprises 34,488 conversations and 107,631 turns, spanning 13 Arab countries and 11 high-impact social domains. It features city-level grounding and rich multi-turn conversational context.
Alexandria’s conversations are enriched with metadata for city-level dialects, persona roles, and speaker-addressee gender, supporting both context- and gender-aware evaluation.
The Alexandria Dataset was developed in two human-centric phases. In the first phase, participants translated the source data from English into specific Arabic dialects. This was followed by a review and revision phase, where 31.6% of the turns were modified.
Alexandria features metadata for city-level varieties and gender-specific interactions, specifically female-to-male, male-to-female, male-to-male, and female-to-female. To ensure stylistic accuracy, registers were adapted based on the scenario and persona, enabling a context-aware evaluation that reflects natural code-switching and social dynamics.
The paper evaluates 24 Arabic-capable LLMs under turn-level, context-level, and conversation-level prompting with metadata-aware inputs, and reports spBLEU and chrF++ for automatic assessment.
Results show a strong directional asymmetry: dialect-to-English is easier than English-to-dialect. Human evaluation reports high gender adherence, while dialectness remains a major challenge for many systems.
Dataset Stats
107,631
Total human-translated English-to-dialect turns
13
Arab countries
11
Domains
4
Regional groups: Levant, Gulf, Nile, and Maghreb
Swipe horizontally in the table to view all countries and totals.
| Domain | Levant | Gulf | Nile | Maghreb | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| JO | LB | PS | SY | SA | OM | YE | EG | SD | LY | MA | MR | TN | |
| Agriculture/Farming | 825 | 1,140 | 1,770 | 931 | 1,162 | 915 | 529 | 583 | 163 | 231 | 570 | 970 | 481 |
| Commerce/Transactions | 750 | 1,004 | 1,595 | 749 | 1,020 | 650 | 579 | 506 | 201 | 160 | 445 | 757 | 401 |
| Construction/Real Estate | 859 | 995 | 1,761 | 861 | 1,161 | 974 | 696 | 660 | 225 | 271 | 574 | 673 | 485 |
| Education/Academia | 816 | 1,191 | 1,513 | 831 | 1,017 | 1,079 | 563 | 549 | 170 | 220 | 601 | 863 | 551 |
| Energy/Resources | 786 | 1,048 | 1,715 | 928 | 1,177 | 937 | 587 | 625 | 189 | 243 | 447 | 719 | 470 |
| Everyday/Social | 967 | 1,215 | 1,697 | 787 | 1,020 | 888 | 642 | 604 | 175 | 210 | 595 | 824 | 550 |
| Healthcare/Medical | 727 | 1,240 | 1,728 | 781 | 1,043 | 895 | 548 | 487 | 164 | 253 | 556 | 948 | 522 |
| Legal/Financial | 693 | 1,006 | 1,566 | 757 | 857 | 753 | 496 | 539 | 177 | 174 | 481 | 642 | 412 |
| Logistics/Transport | 842 | 1,020 | 1,512 | 950 | 1,234 | 842 | 629 | 646 | 189 | 187 | 593 | 877 | 515 |
| Professional/Workplace | 845 | 1,220 | 1,810 | 959 | 1,112 | 866 | 549 | 645 | 178 | 253 | 480 | 709 | 526 |
| Tourism/Hospitality | 720 | 1,161 | 1,596 | 884 | 1,004 | 815 | 608 | 608 | 190 | 216 | 567 | 878 | 460 |
| Total | 8,830 | 12,240 | 18,263 | 9,418 | 11,807 | 9,614 | 6,426 | 6,452 | 2,021 | 2,418 | 5,909 | 8,860 | 5,373 |
Qualitative Examples
Resources
Read the full Alexandria research paper on arXiv.
Open arXivBrowse the Alexandria dataset release on Hugging Face, including splits and hosted dataset files.
Open Dataset on 🤗Access the Alexandria project repository on GitHub for evaluation code, updates, and related project materials.
Open GitHubAuthors
Citation
@misc{mekki2026alexandriamultidomaindialectalarabic,
title={Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs},
author={Abdellah El Mekki and Samar M. Magdy and Houdaifa Atou and Ruwa AbuHweidi and Baraah Qawasmeh and Omer Nacar and Thikra Al-hibiri and Razan Saadie and Hamzah Alsayadi and Nadia Ghezaiel Hammouda and Alshima Alkhazimi and Aya Hamod and Al-Yas Al-Ghafri and Wesam El-Sayed and Asila Al sharji and Mohamad Ballout and Anas Belfathi and Karim Ghaddar and Serry Sibaee and Alaa Aoun and Areej Asiri and Lina Abureesh and Ahlam Bashiti and Majdal Yousef and Abdulaziz Hafiz and Yehdih Mohamed and Emira Hamedtou and Brakehe Brahim and Rahaf Alhamouri and Youssef Nafea and Aya El Aatar and Walid Al-Dhabyani and Emhemed Hamed and Sara Shatnawi and Fakhraddin Alwajih and Khalid Elkhidir and Ashwag Alasmari and Abdurrahman Gerrio and Omar Alshahri and AbdelRahim A. Elmadany and Ismail Berrada and Amir Azad Adli Alkathiri and Fadi A Zaraket and Mustafa Jarrar and Yahya Mohamed El Hadj and Hassan Alhuzali and Muhammad Abdul-Mageed},
year={2026},
eprint={2601.13099},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.13099},
}