The Challenge in Mapping Multilingual Media Coverage on Migration and Freedom of Movement in Europe
October 16, 2019
By Fabienne Lind and Jakob-Moritz Eberl
In recent years, the number of texts than can be retrieved from the Internet for research purposes has massively increased through the use of (online) newspaper databases. To manage the methodological challenges that arise from this overload of textual materials, computer scientists and communication scholars have developed manifold strategies to analyze massive amounts of written text. What all these strategies have in common is that they rely on the power of computer algorithms and apply precisely-tailored tools and procedures so as to automatize text analysis. Using computer-assisted methods of text analysis, scholars have, for example, been able to analyze millions of digitized books, providing insights into the evolution of English grammar and retracing collective memory between 1800 and 2000. In another instance, such methods allowed researchers to analyze massive numbers of Twitter messages to correctly predict election outcomes.
However, while the existing techniques are well established for English-language text, the situation is different when it comes to the study of text in more than one language and in languages other than English. Solutions to this difficult endeavour – or at least improved strategies – are thus of high interest to the research community. A core objective of REMINDER Work Package 8 has been to contribute to this quest.
Our research: Mapping European media discourse on intra-EU mobility
The main goal of Work Package 8 has been to map discourses on migration and intra-EU mobility across EU countries from 2000 until 2017. The project’s focus is on discourses in mass media, given the crucial role these discourses play as links between politics and citizens (find a comprehensive literature review on migration media coverage (by Work Package 8) and media effects (by REMINDER Work Package 9) here).
Work Package 8 examines around 1.5 million news articles on migration and freedom of movement. The news articles were published in the main print and online newspapers in Spain, the UK, Germany, Sweden, Poland, Hungary and Romania.
The challenge: Text in 7 different languages
Our multilingual text corpora thus originate from Spanish, English, German, Swedish, Polish, Hungarian, and Romanian text sources. Not only are these all very distinct languages, they actually belong to different language families and subfamilies and thus follow rather distinctive principles and rules. They differ, for example, as regards the richness of the language (e.g., the number of synonyms) or the structure of words (e.g., word stems, compound words). These are just a few examples to illustrate the challenges involved in analysing text in different languages. Strategies for multilingual text analysis that rely on the power of computer assistance have to take these special linguistic aspects into account.
Strategies for large-scale text analysis
One of the methods used to analyse large amounts of text, whether monolingual or multilingual, is the so-called dictionary approach. Here, a dictionary is a list of words or phrases, referred to as ‘features’. The analysis strategy assumes that concepts (i.e., whatever one wants to measure within the corpus) are reflected in these dictionary features. If enough of these features are found in a text, this text can then be classified as belonging to the specific concept.
To construct a dictionary several steps are performed:
- pre-selection of dictionary features,
- pre-processing of these features,
- validation, a quality test of the dictionaries (i.e., the classification decisions of the dictionary are compared with the classification decisions of humans).
The multilingual case: The construction of dictionaries and their application
A multilingual dictionary is a dictionary that includes feature lists in different languages, all for the measurement of one single concept. However, the construction of such a multilingual dictionary follows the steps outlined above only to a certain extent.
The main differences in the case of multilingual dictionary construction concern the following additional steps or alterations of the steps outlined above:
- machine translation of features (for a humorous example of challenges that may arise when using machine translation software, see here),
- during the pre-processing step, one has to consider language particularities (e.g., grammatical rules, richness of language, structure of words),
- validation of translated features in collaboration with native speakers.
The research group of Work Package 8 tested how different dictionary construction strategies (i.e., pre-selection of dictionary features, pre-processing of these features) contribute to the construction of high-quality multilingual dictionaries.
After that, we investigated procedures of dictionary application. In fact, there are (at least) two approaches to apply dictionaries when it comes to the analysis of a multilingual text corpus.
Approach A: the application of a multilingual dictionary to a multilingual text corpus.
Approach B: the translation of the multilingual text corpus into a target language (mostly English) and the application of a dictionary in this language (mostly English).
We tested and compared how both approaches work, and concluded that, if researchers lack the language skills or resources for extensive refinement of multilingual dictionaries (e.g. additional efforts to account for country discourse-specific keywords), it is preferable to machine translate the corpus and apply English-language instruments (Approach B). The findings are summarized in a paper that can be accessed here.
Measuring migration frames
The findings helped the WP8 team subsequently to identify the best-performing dictionary and analysis approach for the multilingual migration text corpus. We constructed English language dictionaries for the measurement of migration frames and applied them to machine translated English media articles from seven European member countries between 2000 and 2017. By ‘migration frames’, we mean migration-related subtopics that are based on varying problem definitions or causal interpretations. Examples are the ‘economy frame’, or the ‘security frame’. A security frame, for example, would include features such as ‘crime’, ‘criminal’, ‘border protection’ and ‘police’. By measuring these frames, it is possible to compare the number of articles that connect migration to a specific frame across countries and time.
The findings of the research are discussed in a Work Package 8 report.
Fabienne Lind and Jakob-Moritz Eberl are researchers at the Department of Communication at the University of Vienna working on REMINDER Work Packages 8 and 9.