Zaid Alyafeai

About Me

I am currently a postdoc at KAUST working under Prof. Bernard Ghanem. I defended my PhD thesis at KFUPM in January 2024. I am the co-founder of arbml, an initiative to support Arabic NLP research and tools. I am also a founding member of fihmai, which targets publishing resources that enrich AI content in Arabic. I was part of bigscience, where I helped in multiple working groups including Tokenization, Data Sourcing, and Prompting. I am currently a member of C4AI, an open research environment where I co-lead the Arabic effort. I mostly hang out in our arbml discord if you want to connect with me, or drop me a message using my email.

Research Interests

My current research interest is metadata extraction from the general domain, with a specific interest in scientific papers. I am also working extensively in Arabic NLP, specifically training culturally sensitive and safety-aware language models.

Latest News

[29/05/2026] Global PIQA: Evaluating Commonsense Reasoning Across 100+ Languages and Cultures published on arXiv.
[18/05/2026] CounterCount: A Diagnostic Framework for Counting Bias in Vision Language Models published on arXiv.
[03/04/2026] Joined ARR communication team.
[04/03/2026] Part of the organizing committee of ArabicNLP 2026. Scholarship and Awards co-chair.
[21/02/2026] PromptLab: A Collaborative Platform for Prompt Engineering and Dataset Curation accepted at EACL Demo 2026.
[08/11/2025] MeXtract: Light-Weight Metadata Extraction from Scientific Papers published on arXiv.
[01/10/2025] Scientific Committee of KAUST Rising Stars in AI Symposium 2026.
[03/09/2025] BALSAM: A Platform for Benchmarking Arabic Large Language Models accepted at ArabicNLP 2025.
[20/08/2025] MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs accepted to EMNLP Findings 2025.
[11/09/2024] Part of the organizing committee of ArabicNLP 2025. Scholarship and Awards co-chair.
[18/03/2024] Started a research internship at Stability AI.
[16/03/2024] Three papers accepted at ACL 2024: CIDAR, ArabicMMLU, and Aya Dataset.
[20/02/2024] ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic published on arXiv.
[13/02/2024] Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning published on arXiv.
[06/02/2024] CIDAR: Culturally Relevant Instruction Dataset For Arabic published on arXiv.
[01/01/2024] Successfully defended my PhD thesis.
[07/12/2023] Attended the Arabic NLP conference at EMNLP 2023.
[31/10/2023] A presentation at KSGAAL about Arabic poetry generation and analysis. Check slides Google Slides.
[28/10/2023] Reached +2000 citations on Google Scholar.
[12/10/2023] Investigating Zero-shot Cross-lingual Language Understanding for Arabic accepted at ArabicNLP co-located with EMNLP 2023.
[10/09/2023] ArabicNLP program committee member co-located with EMNLP 2023.
[12/07/2023] Ashaar: Automatic Analysis and Generation of Arabic Poetry Using Deep Learning Approaches published on arXiv.
[29/06/2023] Taqyim: Evaluating Arabic NLP Tasks Using ChatGPT Models published on arXiv.
[06/08/2023] NLP-OSS program committee member co-located with EMNLP 2023.
[31/05/2023] Invited talk titled “Teach Me Once I Learn Much More: Fine Tuned LLMs are Zeroshot Task Generalizers” at JCRAI KFUPM. Check slides.
[30/05/2023] Reached 1,000 citations on google scholar.
[27/05/2023] Crosslingual Generalization through Multitask Finetuning has been accepted as a poster presentation at ACL 2023.
[27/10/2022] Second in the AI in Sports challenge organized by the Ministry of Sports.
[19/09/2022] The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset has been accepted at Datasets and Benchmarks track NeurIPS 2022.
[12/05/2022] Invited talk titled “Masader: Documenting Arabic NLP Data Resources” at IWABigDAI. Check slides.
[07/05/2022] PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts has been accepted in the demo track of ACL 2022.
[18/04/2022] Masader: Metadata Sourcing for Arabic Text and Speech Data Resources has been accepted at LREC 2022.
[20/01/2022] Multitask Prompted Training Enables Zero-Shot Task Generalization has been accepted as a spotlight at ICLR 2022.
[22/02/2021] Arabic Compact Language Modelling for Resource Limited Devices accepted at WANLP 2021.
[26/09/2020] ARBML: Democratizing Arabic Natural Language Processing Tools accepted at NLP-OSS 2020.
[19/09/2020] Passed the PhD comprehensive exam.