Project Information
- Stakeholder: Human Language Technology (HLT) Research Group, CSIR
- Students: Juwaki Ledwaba (UL), Thabo Mahlangu (TUT), Chris Zitha (UL)
- Project Lead: Avashlin Moodley
- Project Mentors: Anathi Mafuna, Aby Louw, Karen Calteaux, Febe de Wet, Georg Schlunz, Carmen Moors
- Year: 2016/2017
Project Description
Language identification (LID) is a machine learning task that involves classifying a piece of text to the most possible language class in the LID model. LID is considered to solve problems for long pieces of text. However, there are still challenges in the realm of short texts. The project involved scraping data from various websites using the import.io scraping tool. Project LID’s main objective was identifying languages in sports i.e. in a certain type of sport like soccer which language is used the most.
In the first phase of the project the team was given a language identification (LID) model by the Human Language Technology (HLT) research group to use on data to come up with interesting facts. They collected data from different sports websites such as Soccer Laduma and ESPN, the data collected was from the comment sections. The data which was collected using tool called Import.oi. First phase of Project LID brought out interesting results; Sepedi is also used more often than all the other languages in rugby, meaning that Pedi people enjoy rugby. It was also revealed that see that isiNdebele was the least language to be used in all sports typed which suggests that they do not enjoy sports.
During the second phase it was discovered that the LID model we used in the first phase was not very accurate so they had to create their own model which had to be more accurate than the one we used previously. Project LID team created their own model using a software used to create models called Conditional Random fields(CRF) suite which we use to test and train our model. They collected new data from the same sources were they collected data from in the first phase. Then a model called Named Entity Recognition (NER) was used to identify entities like proper nouns, location and time in a given piece of text. As a result, the NER model helped increase the accuracy of our model because it was able to pick up entities. The finding of how the languages were used we had to come up with new visualizations for instance we had have a chart which compares the two model; they should the trend of how a language is used per sport.
Challenges they were faced with collecting data because some sports websites did not have comment sections so they could not get the desired data, some websites did not allow them to scrape data from them, so we had to choose websites which allowed them to scrape data from them.
Student Remarks
The students learned how to build and use a LID and NER model. Software skills were gained during this process. Finally, they learned that LID is something which need to be improved especially in our country with many different languages.
Author: Team + Nolihle Gulwa, B Tech Journalism, Walter Sisulu University.