Modeling fine-grained sociolinguistic variation. The promises and pitfalls of Twitter corpora and neural word embeddings

Publié le 25 novembre 2024 Mis à jour le 25 novembre 2024

Filip Miletić | CLLE, CNRS & Université Toulouse - Jean Jaurès | IMS, Universität Stuttgart Anne Przewozny-Desriaux | CLLE, CNRS & Université Toulouse - Jean Jaurès Ludovic Tanguy | CLLE, CNRS & Université Toulouse - Jean Jaurès

This chapter examines the use of recent data sources and computational methods to study fine-grained sociolinguistic phenomena. We deploy a custom-built corpus of tweets (Miletić et al. 2020) and neural word embeddings to investigate the use of contact-induced semantic shifts in Quebec English. Drawing on an analysis of 40 lexical items, we show that our approach is beneficial in facilitating manual inspection of vast amounts of data and establishing fine-grained patterns of language variation. While it is affected by a range of noise-related issues, which we describe in detail, coarse-grained annotation provides an efficient way of circumventing them. We use the results filtered in this way to conduct a quantitative analysis of sociolinguistic constraints on contact-induced semantic shifts, further confirming the relevance of our approach.
https://doi.org/10.1075/scl.118.09mil