Textual Data Corpora

In this project, initiated and supported by Mafat (Defense Ministry Administration for Technological Infrastructure), a dialectic Arabic corpus is to be created from textual sources.

Data Sources: A wide variety of written texts in dialectic Arabic including social media, news platforms, and academic sources.

 

The two main efforts of this project are:

 

Textual Data Mapping: This phase of the project is focused on identifying and mapping the characteristics of publicly available dialectic Arabic datasets. For example, for an existing corpus tagged by sentiment, we identify characteristics such as source of corpus, platform, dialect, size of corpus, and year it was created. This mapping effort will provide an inventory of available corpora which can be used for the following phases of the project.

 

Textual corpus creation: This phase of the project is focused on identifying the most relevant sources for spoken Israeli Palestinian dialectical Arabic. Once identified, three corpora are created for the three NLP tools: sentiment analysis, summarization and coreference resolution.

 

Resources:
MAFAT: https://english.mod.gov.il/Pages/default.aspx

 

    • Abraham Israeli

    • Dr. Yossi Mann

    • Dr. Kfir Bar

    • Maya Stemmer

    • Dr. Shai Fine

    • Aviv Naaman

    • Matan Sheskin

    • Moran Daxa

    • Sali Arieli

    • Rotem Ezekiel

    • Shahar Nissim

    • Itai Mohaban