Alexandria

Paper and Dataset

Alexandria: English to Local Arabic Dialect Conversations at Scale

Alexandria is a community-built parallel dataset of English and dialectal Arabic conversations across high social-impact domains, featuring human-translated and human-revised content designed for culturally grounded Arabic MT, natural code-switching, and inclusive gender directions (male and female), while capturing diverse scenarios with specific personas and roles.

Coverage
Spans healthcare, education, legal and financial services, logistics, tourism, and other public-facing domains.
Structure
Preserves multi-turn conversational context with turn-level English-dialect alignment, enriched with persona metadata including roles and gender.
Linguistic Focus
Includes subdialects, naturally occurring code-switching across several countries, and registers adapted to specific conversation scenarios.

Paper Summary

Highlights from the paper

Scale and Coverage

Alexandria comprises 34,488 conversations and 107,631 turns, spanning 13 Arab countries and 11 high-impact social domains. It features city-level grounding and rich multi-turn conversational context.

Metadata Depth

Alexandria’s conversations are enriched with metadata for city-level dialects, persona roles, and speaker-addressee gender, supporting both context- and gender-aware evaluation.

Revision Quality Signals

The Alexandria Dataset was developed in two human-centric phases. In the first phase, participants translated the source data from English into specific Arabic dialects. This was followed by a review and revision phase, where 31.6% of the turns were modified.

Gender and Code-Switching

Alexandria features metadata for city-level varieties and gender-specific interactions, specifically female-to-male, male-to-female, male-to-male, and female-to-female. To ensure stylistic accuracy, registers were adapted based on the scenario and persona, enabling a context-aware evaluation that reflects natural code-switching and social dynamics.

Evaluation Setup

The paper evaluates 24 Arabic-capable LLMs under turn-level, context-level, and conversation-level prompting with metadata-aware inputs, and reports spBLEU and chrF++ for automatic assessment.

Main Benchmark Findings

Results show a strong directional asymmetry: dialect-to-English is easier than English-to-dialect. Human evaluation reports high gender adherence, while dialectness remains a major challenge for many systems.

Dataset Stats

Coverage and scale

107,631

Total human-translated English-to-dialect turns

13

Arab countries

11

Domains

4

Regional groups: Levant, Gulf, Nile, and Maghreb

Swipe horizontally in the table to view all countries and totals.

Domain Levant Gulf Nile Maghreb
JO LB PS SY SA OM YE EG SD LY MA MR TN
Agriculture/Farming 8251,1401,7709311,162915529583163231570970481
Commerce/Transactions 7501,0041,5957491,020650579506201160445757401
Construction/Real Estate 8599951,7618611,161974696660225271574673485
Education/Academia 8161,1911,5138311,0171,079563549170220601863551
Energy/Resources 7861,0481,7159281,177937587625189243447719470
Everyday/Social 9671,2151,6977871,020888642604175210595824550
Healthcare/Medical 7271,2401,7287811,043895548487164253556948522
Legal/Financial 6931,0061,566757857753496539177174481642412
Logistics/Transport 8421,0201,5129501,234842629646189187593877515
Professional/Workplace 8451,2201,8109591,112866549645178253480709526
Tourism/Hospitality 7201,1611,5968841,004815608608190216567878460
Total 8,83012,24018,2639,41811,8079,6146,4266,4522,0212,4185,9098,8605,373

Qualitative Examples

Samples from Alexandria Dataset

Resources

Alexandria Resources

Available now

Paper

Read the full Alexandria research paper on arXiv.

Open arXiv
Available now

Data

Browse the Alexandria dataset release on Hugging Face, including splits and hosted dataset files.

Open Dataset on 🤗
Available now

Code

Access the Alexandria project repository on GitHub for evaluation code, updates, and related project materials.

Open GitHub

Authors

Alexandria team

Citation

How to cite Alexandria

If you use Alexandria or the associated methodologies, please cite the original paper using the following BibTeX entry.

@misc{mekki2026alexandriamultidomaindialectalarabic,
  title={Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs},
  author={Abdellah El Mekki and Samar M. Magdy and Houdaifa Atou and Ruwa AbuHweidi and Baraah Qawasmeh and Omer Nacar and Thikra Al-hibiri and Razan Saadie and Hamzah Alsayadi and Nadia Ghezaiel Hammouda and Alshima Alkhazimi and Aya Hamod and Al-Yas Al-Ghafri and Wesam El-Sayed and Asila Al sharji and Mohamad Ballout and Anas Belfathi and Karim Ghaddar and Serry Sibaee and Alaa Aoun and Areej Asiri and Lina Abureesh and Ahlam Bashiti and Majdal Yousef and Abdulaziz Hafiz and Yehdih Mohamed and Emira Hamedtou and Brakehe Brahim and Rahaf Alhamouri and Youssef Nafea and Aya El Aatar and Walid Al-Dhabyani and Emhemed Hamed and Sara Shatnawi and Fakhraddin Alwajih and Khalid Elkhidir and Ashwag Alasmari and Abdurrahman Gerrio and Omar Alshahri and AbdelRahim A. Elmadany and Ismail Berrada and Amir Azad Adli Alkathiri and Fadi A Zaraket and Mustafa Jarrar and Yahya Mohamed El Hadj and Hassan Alhuzali and Muhammad Abdul-Mageed},
  year={2026},
  eprint={2601.13099},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2601.13099},
}