GSMA and Pleias Launch Open-Source Model to Correctly Identify 61 African Languages in AI Systems

The GSMA and French AI research company Pleias have released CommonLingua, an open-source language identification model covering 334 languages including 61 African languages, addressing a foundational gap in AI systems that has caused African-language text to be routinely misidentified.

Released under the GSMA’s African AI Languages Model Project, the two-million-parameter model achieves 83% accuracy in identifying African languages — a significant improvement over existing systems. Leading language identification tools such as fastText, GlotLID and OpenLID were built primarily around European and Asian languages, and African-language text is frequently mislabeled as English or French. Even state-of-the-art AI models lose roughly 30 percentage points in accuracy on African languages compared to major world languages, according to the GSMA.

CommonLingua covers 61 African languages across eight language families: Bantu with 21 languages, Niger-Congo and West African with 18, Afro-Asiatic and Semitic with 7, Cushitic and Chadic with 4, Berber with 3, Nilo-Saharan with 3, and pidgins, creoles and other languages with 5. The model operates directly on UTF-8 byte sequences rather than relying on language-specific tokenizers, enabling consistent handling across scripts including Latin, Arabic, Ethiopic, N’Ko and Tifinagh.

Pierre-Carl Langlais, co-founder and chief technology officer at Pleias, said the model addresses the first essential step in building AI infrastructure for African languages. “African languages are not an edge case. They are the working languages of hundreds of millions of people, and they deserve AI infrastructure built with the same care as any other language,” he said. “CommonLingua is deliberately the first brick we are laying: you cannot curate what you cannot identify.”

Louis Powell, director of AI initiatives at the GSMA, said the lack of foundational infrastructure has long held back progress on African-language AI. “CommonLingua addresses this critical gap, enabling the development of richer datasets and more representative AI systems at scale,” he said.

Africa is home to between 2,000 and 3,000 distinct languages. Nigeria alone has more than 500, and South Africa has 11 official languages, yet only one in 10 South Africans speak English at home — the language that dominates the internet and most AI training data.

The model was trained exclusively on open-licensed and public domain content aggregated through the Common Corpus project, drawing on sources including Wikipedia, scientific publications in OpenAlex, VOA Africa, WaxalNLP, cultural heritage archives and Pralekha. All datasets are released under permissive licenses.

#GSMA #Pleias #Launch #OpenSource #Model #Correctly #Identify #African #Languages #Systems

Related Posts

Leave a Reply Cancel reply