DiaCorpus - Creation of Dialectic Arabic Corpora

Part of the National Natural Language Processing Program of Israel

Goal


Development of a large, comprehensive, and annotated corpora in Palestinian Arabic dialect. The corpora is to be made broadly available and easily accessed by both academia and industry for NLP application developments 

Method


Recorded speech and textual data sources of dialectic Arabic will be used in the creation of the corpora. Data will be collected from a variety of publicly available datasets as well as generated at the Data Science Institute. The data will be cleansed, tagged for a variety of NLP tasks such as sentiment analysis, name entity recognition (NER), co-reference, emotion recognition, and summarization, and placed into a broadly available, properly annotated corpora that can be easily accessed for NLP application developments.

Technological advances in AI and machine learning, and specifically developments in Natural Language Processing (NLP) have brought about a flurry of academic research and industry initiatives, many of which have matured into products which play a significant role in the everyday lives of people worldwide. Some common applications are digital personal assistants, advanced search engines, automatic translation services and more. However, with most NLP tools being developed in English, NLP applications in other languages usually lag behind in terms of robustness, accuracy and efficiency.

 

In Israel, the two national languages, Hebrew and Arabic have also been affected by this disparity. Although some initial developments in the Hebrew language have recently started to emerge, the two languages are still far behind their English counterparts.

 

The main challenge of NLP developments in Hebrew and Palestinian dialectic Arabic is that two sematic languages are challenging and more difficult to analyze, and thus the quality of understanding and recognizing human language in Hebrew and Arabic is lower and constitutes a barrier to advanced and quality services. In addition, the large-scale Hebrew and dialectic Arabic datasets needed to develop and train NLP models are scarce.

 

The National Natural Language Processing Program of Israel (NNLP-IL) is a national initiative for the creation of infrastructure, and for research and development of advanced capabilities for the advancement of the field of NLP in Hebrew and dialectic Arabic. Once implemented, this infrastructure will facilitate the development of a variety of NLP applications such as chatbots, prediction models and language models.

 

Guiding Principles:

  • Generic – the generic framework will allow fitting and customizing solutions for a variety of applications (without focusing on specific use cases).
  • Open sourced - Everyone can take part, contribute and use.
  • Break through the data barrier - Create tagged and untagged datasets and make them accessible to the general public.
  • Usability - distribute capabilities through user manuals, code repositories, faqs and more.

 

DSI as part of the national project:

The DSI has been selected to develop one of the modules in this national program: The creation of dialectic Arabic corpora and datasets. 

 

 

Resources:

For more information on the National Program for Natural Language Processing: https://www.nationalplanil.ai/

GIthub Project: https://github.com/NNLP-IL/NNLP-IL