The computational linguistics group at Uppsala University performs research and development in different areas related to SWE-CLARIN. In 2016, we concentrate our contribution on one main project, but at the same time continue working on some smaller projects.
The long term goal of the main project is to provide web-based tools for an automatic quantitative analysis of texts for linguists and other researchers in the humanities and social sciences. We have developed SWEGRAM, a web service that provides automatic linguistic annotation at word and sentence level. It includes everything from tokenisation, sentence segmentation and normalisation of misspellings to annotation of words with their parts of speech and morphological information and syntactic structure via dependency relations. The linguistic annotation can then be used to automatically derive statistics on different linguistic characteristics of the texts, for example, the number of words and sentences in a text, the average number of characters in a word, the distribution of word classes or different measures of readability.
Fig. 1: SWEGRAM: A web-based tool for automatic linguistic annotation and analysis for Swedish texts.
We have developed and tested our web-based tool for annotation and quantitative analysis on student essays for the national exam in Swedish and Swedish as a second language for different grades (3rd, 6th, 9th grade). We have also created a linguistically annotated corpus of student essays with roughly 1,5 million words, which has been automatically annotated on the morpho-syntactic level (Megyesi et al., 2016).
The project is performed in collaboration with Anne Palmér and other researchers of Swedish at the Department of Scandinavian Languages at Uppsala University. Currently, we focus on the identification of linguistic features that are characteristic for student essays in different grades.
Fig. 2: Examples of potential misspellings.
Moreover, we improve the normalisation tool to find potential misspellings and correct them so that it is applicable to different kinds of texts (for example, student essays or historical texts). We hope that our solutions are general and simple enough to be applied by others who are interested in different kinds of corpus linguistic research.
Side projects
The side projects are mainly pursued with own funding and in conjunction with existing projects within the group.
Information retrieval in historical texts: This project is carried out in collaboration with the Department of History at Uppsala University and aims to support information retrieval in historical documents on what men and women did for a living from 1500 to 1800. In 2016, we focus on evaluation of the automatic retrieval together with the historians.
Information extraction from historical hand writings: This project is a collaboration with the project group q2b (from quill to bytes) with researchers from image analysis and aims at extracting both content information and metadata (e.g. author identification) in historical hand writings.
Swedish word processing (SWORD): This project also involves other groups within SWE-CLARIN and aims to document and standardize linguistic anlaysis of Swedish texts on the word level, including tokenization, part-of-speech tagging, morphological analysis and lemmatisation. In 2016, we focus on tokenization, which has led to a contribution to SLTC in Umeå.
Contrastive investigation of noun phrases in translated texts: This project is involves collaboration with translation sciences (Institut für Angewandte Sprachwissenschaft sowie Übersetzen und Dolmetschen at the Universität des Saarlandes in Saarbrücken) and aims at performing a contrastive investigation of nominal phrases in English texts and their German translations supported by language technological analysis. The result of the analysis will for the basis for both translation theoretic conclusions and improvements of machine translation systems.
Beata Megyesi, Jesper Näsman, and Anne Palmér. The uppsala corpus of student writings: Corpus creation, annotation, and analysis. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France, may 2016. European Language Resources Association (ELRA).