Building, maintaining, and using a long-term empirical basis for German Linguistics

Publié le 6 juillet 2016 Mis à jour le 15 novembre 2016
le 17 novembre 2016
14 H 00
Salle E412 - Maison de la Recherche

Marc Kupietz (Institute for the German Language, Mannheim) - Séminaire CLLE-ERSS

At the Institute for the German Language (IDS) in Mannheim corpora have been built and used since its foundation in 1964. Probably more than upcoming Artificial Intelligence and the Brown corpus that was built at roughly the same time, the main motivation behind this was to make German linguistics less susceptible to ideological influences, which it embraced during the Nazi period and which were observable in East Germany, by relying on strict empiricism and pushing the discipline into the direction of the hard sciences (cf. Teubert & Belica 2014: 298).

Beginning with the Mannheimer Korpus I (MK I – 2.2 million words) in 1969 a series of corpora have been compiled, used and made available at the IDS. Since 2004 this collection is called the (Mannheim) German Reference Corpus DeReKo (Kupietz et al. 2010) and is continuously expanded as a whole to now more than 30 billion words, growing by 1 billion words per year. While in the days of MK I users had to punch their own Fortran programs to search and analyze the corpora, beginning with REFER in 1983, followed by COSMAS I (1992), COSMAS II (2003) and KorAP (2016) (Bański et al. 2013) the IDS also has been providing specialized corpus linguistic software for DeReKo’s now more than 38,000 users.

In the first part of my talk, I will sketch the history of DeReKo, its aims, the development of its German (corpus) linguistic background and its use inside and outside the IDS, its institutional embedding, design principles, expansion strategy, strategy to cope with legal obligations, and recent developments. In the second part of my talk, I will introduce the new corpus analysis platform KorAP that will smoothly replace COSMAS II during the coming years and I will discuss the challenges it aims to cope with, ranging from computational to epistemological ones.



Bański, Piotr/Bingel, Joachim/Diewald, Nils/Frick, Elena/Hanl, Michael/Kupietz, Marc/Pęzik, Piotr/Schnober, Carsten/Witt, Andreas (2013): KorAP: thenewcorpusanalysisplatformatIDSMannheim. In: Vetulani, Zygmunt/Uszkoreit, Hans (eds.): Human Language Technologies as a Challenge for Computer Science and Linguistics. Proceedings of the 6th Language and Technology Conference. S. 586-587 - Poznan: Fundacja Uniwersytetu im.

Kupietz, Marc/Belica, Cyril/Keibel, Holger/Witt, Andreas (2010): TheGermanReferenceCorpusDeReKo: Aprimordialsampleforlinguisticresearch. In: Calzolari, Nicoletta et al. (eds.): Proceedings of the seventh conference on International Language Resources and Evaluation (LREC 2010). S. 1848-1854 -ELRA, 2010.

Teubert, Wolfgang/Belica, Cyril (2014): Von der linguistischen Datenverarbeitung am IDS zur “Mannheimer Schule der Korpuslinguistik”. In: Institut für Deutsche Sprache (Hrsg.):AnsichtenandEinsichten. 50 JahreInstitutfürDeutscheSprache.