Spoken Arabic

In this project, initiated and supported by the Israeli Innovation Authority, a corpus is to be developed from recorded speech of dialectic Arabic.

Data Sources: The corpus is being developed from publicly available audio recordings such as interviews and podcasts, as well as over 30 hours of newly recorded podcasts produced at the Data Science Institute in collaboration with Reichmann University's radio station, AudioVersity. This newly generated speech data is based on recordings of Arab Israelis. In order to cover the full range of dialectical nuances, the content includes a wide range of topics (e.g. technology, education, innovation) and participants represent a varied demographic background and a variety of local, geographical sub-dialects (e.g. Northern Israel, Southern Israel, East Jerusalem).

 

To the Audioversity website Click Here

 


Data Transcription: The full body of video and audio recordings will be transcribed to get reliable textual content of dialectic Arabic in the following ways:

  • Automatic speech to text technology will be used for initial transcription
  • Fine tuning will be performed manually by human transcribers

 


Data Annotation:

  • Following transcription, each text component is annotated according to its sentiment, emotion and NER (Named Entity recognition).
  • The annotation process is performed by a large team of students and is supported by technology provided to the DSI by Label Studio, an open source labeling platform.


Baseline models: For each task we will produce a baseline to be used as a reference which can be used for NLP model training.

 

This corpus will be made widely available through an open source license.

 

Resources:

 

Label Studio: https://labelstud.io/

Israeli Innovation Authority: https://innovationisrael.org.il/

National Natural Language Processing Plan of Israel: https://www.nationalplanil.ai/

 

Dataset Website: https://idc-dsi.github.io/DiaCorpus/

 

 

 

    • Abraham Israeli

    • Dr. Yossi Mann

    • Dr. Kfir Bar

    • Raawa Makhoul

    • Julian Jubran

    • Dr. Shai Fine

    • Aviv Naaman

    • Matan Sheskin

    • Dana Qaraeen

    • Eden Bokobza

    • Guy Maduel

    • Matan Eblagon