Universiteit Leiden

nl en

Leiden University Centre for Digital Humanities

Barbiers Project

Detecting cross-linguistic Syntactic Differences Automatically (DeSDA)

Sjef Barbiers

Outline

The goal of this project is to investigate the possibility of automatic detection of syntactic differences between languages by using on-line parallel corpora and software tools for annotation, search and analysis. This approach has the potential to greatly enhance the empirical basis of theoretical comparative syntax research and will enable syntacticians to do theoretical modeling of syntactic variation based on quantitative analysis of the correlations between syntactic properties. Since the project should be seen as a test case for this larger goal and has to be carried out by one PhD student, the research topic will be narrowed down to the (morpho-)syntax of verbs in Germanic languages, including auxiliaries. The main descriptive question will be: Which differences do we find in the Germanic languages with respect to the structural positions of verbs and with respect to verbal inflection? The main theoretical question will be: To which extent is the existing theory of verb placement and inflection as it has been developed since the late eighties of the 20th century capable of capturing these facts? Background Comparative syntax is the branch of linguistics that describes cross-linguistic syntactic similarities and differences and tries to capture them in a formal theory that explains range, limits and loci (i.e. place within the mental grammar) of syntactic variation in natural language [1]. It seeks to answer the question which syntactic properties are universal, which are language specific and how these properties interact. Traditionally, syntacticians collect data by comparing their native languages with other languages, by consulting reference grammars and linguistic colleagues (e.g., Taalportaal for Dutch and Frisian [2]; the typological database WALS [3]), by carrying out surveys and fieldwork (cf. [4]) and by expert sourcing (e.g., SSWL [5]). These methods have in common that the researcher needs to have expectations, in the best case a theory, of the kind of syntactic differences to be found. Consequently, many differences will not be detected, descriptions will be incomplete and correlations between syntactic differences will often not be discovered.

Moreover, the number of (morpho-)syntactic differences between two languages or language varieties is potentially very high, even if these varieties are closely related. E.g., the Syntactic Atlas of the Dutch dialects ([6], [7]] describes more than 100 syntactic differences between closely related and superficially very similar Dutch dialects. Since the number of language varieties to be included in the comparison is also very large (a conservative estimation would be 6000 languages times the number of dialects that each of these languages has), it will be clear that these traditional methods of collecting comparative data are very slow and incomplete and that there is a need for corpora and tools that make automatic, systematic and rigorous qualitative and quantitative syntactic comparison possible.

Corpora and tools

The Opus corpus [8] seems to be the answer to this need. It contains large parallel corpora such as Europarl (parallel texts from the European Union) and OpenSubtitles 2016, among many others. According to Tiedemann [9], in 2012 the Opus corpus covered over 90 languages, 3,800 language pairs with sentence-aligned data comprising a total of over 40 billion tokens in 2.7 billion parallel units (= aligned sentences and sentence fragments). Various interfaces are available to search these corpora. In figure 1 we see the output of a multilingual corpus query in Opus, a set of aligned sentences.

 

Fig. 1: Result of a multilingual Opus Corpus Query

The human observer can immediately derive syntactic differences from this result. For example, we see in the second fragment that wilt geven in Dutch corresponds to vil give in Danish, versehen wollen in German and want to give in English, with differences in the inflection of the verbs, the presence of (an equivalent of) to, the position of the two verbs within the clause (e.g., clause final in Dutch but clause initial in Danish) and the relative order of the two verbs (e.g., wilt geven in Dutch but versehen wollen in German).

Tasks and questions

To make automatic extraction of syntactic differences possible, alignment is required also at the level of words (ideally, morphemes), POS-tags and phrases (subtrees). This in turn requires that POS-taggers and parsers are available for each of the languages in the comparison. The OPUS website provides links to language specific POS-taggers and parsers for Czech, Chinese, Danish, Dutch, English, French, German, Hungarian, Italian, Portugese, Russian, Slovene, Spanish, Swedish and Turkish [10]. Syntactic comparison will be possible if the same tagset can be used for each of the languages involved (cf. [11]). A probably more feasible alternative is to define mappings between tag sets, using relations such as ‘identical’, ‘near identical’, ‘subsumes’ etc. A first task of the PhD student is to make uniform or mapped tagging and parsing possible for the Germanic languages involved, in the ideal case also at the morphosyntactic level. A second task is to deal with the errors resulting from automatic tagging and parsing (cf. [12]). A third task is to develop automatic alignment at the level of morphemes, words, tags and subtrees. A fourth task is to develop a tool to select only those fragments for which syntactic comparison is possible. There will be many pairs of sentences for which syntactic comparison will be impossible, not because the syntax of the two languages is too different but because the syntax of the two sentences is too different. E.g., in the search result in figure 2 the English phrase On the subject at hand corresponds to German Zum Thema and Dutch Dan nu het eigenlijke onderwerp. If for Dutch the translation Over dit onderwerp had been chosen, the three phrases could have been compared, but now the translation is “too free” syntactically. The tool that filters out such cases will have to work with a threshold for syntactic correspondences, using Levenshtein distance and other techniques to measure syntactic similarity [13].

 

Figure 2: Search result CQP mode Opus – Europarl

When alignment is provided at these levels, automatic extraction of syntactic differences becomes possible. A good example of an extraction method is described in Wiersma et al. [14]. They compare the varieties of English spoken by two different cohorts of Finnish immigrants in Australia with a method that contains the following steps: POS tag the text corpora to be compared, take n-grams (1-5 grams) of POS tags from it, compare their relative frequencies using a permutation test, sort the significant POS-n-grams by extent of difference, analyse the results. This method provides the results at an aggregate level and identifies the n-grams of POS tags that are primarily responsible for the syntactic differences between the two language varieties. Tools for this method are available [15]. It needs to be investigated whether the method of Wiersma et al. can also be applied in the comparison of language varieties that are less closely related than the English varieties of Finnish immigrants. This will be the fifth task. If feasible, it can be extended to the comparison of parse trees (cf. [16]). The differences thus detected at the various levels of alignment will be stored in a database. The sixth task involves quantitative analysis. Once a list of syntactic differences between the languages under comparison is derived, associations between syntactic variables can be detected with data mining techniques (cf. [13]). The seventh and final task is then to evaluate these associations in the light of current theories of verbal syntax. Although there are many open issues, since Pollock’s seminal paper [17] there is a rough consensus that cross-linguistically at least three structural positions are available where verbs occur, depending on their morphology: A clause final V-position corresponding to the base position of the verb (both finite and non-finite), a clause medial T-position where tensed and inflected verbs occur, and a clause initial C-position where finite verbs go to mark the clause type. In addition, verb placement and inflection has been relatively well described and analyzed for the Germanic languages (cf. [18], [19]) so that both the descriptive and the theoretical results can be evaluated.

Research program and supervision scheme1

(overall supervision Sjef Barbiers (Leiden) and Jan Odijk, UU, director of CLARIAH).

Year Tasks Deliverables Supervision
1

• Adaptation/uniformization/mapping POS-taggers and parsers and error correction

• Developing a tool for alignment of parallel texts at the level of morphemes, words, POS-tags and subtrees

Tools, paper, talk Barbiers/Odijk
2

• Developing a tool for the selection of syntactically comparable fragments

• Testing automatic extraction of syntactic differences using the method of Wiersma et al.

Tool, database of syntactic differences, paper, talk Barbiers/Odijk
3

 • Optimizing the tools

• Data mining to discover syntactic associations

• Integration of data and tools in CLARIAH

Paper, talk Barbiers/Odijk
4 • Evaluation of the results, i.e. the syntactic differences and associations found, against existing descriptions and analyses. • Writing the thesis Dissertation Barbiers/Odijk

 

References

[1] Cinque G. and R. Kayne eds. 2008. The Oxford Handbook of Comparative Syntax. New York: OUP

[2] www.taalportaal.org

[3] www.wals.info

[4] www.dialectsyntax.org

[5] sswl.railsplayground.net/

[6]/[7] Barbiers, S., et al. 2005/8. Syntactic Atlas of the Dutch Dialects. Volumes I and II. Amsterdam: AUP.

[8] opus.lingfil.uu.se/

[9] Tiedemann, J. 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)

[10] cf.//opus.lingfil.uu.se/trac/wiki/Tagging%20and%20Parsing

[11] Cf. http://universaldependencies.org/

[12] Bloem, J. 2016. Evaluating automatically annotated treebanks for linguistic research. In Piotr Banski et al. (eds.) Proceedings of the 4thWorkshop on Challenges in the Management of Large Corpora (CMLC 4), 8 – 14, Paris. ELRA.

[13] Spruit, M. R. 2008. Quantitative perspectives on syntactic variation in Dutch. Diss. U. of Amsterdam. LOT Dissertations 174.

[14] Wiersma, W., Nerbonne, J., and T. Lauttamus, 2011. Automatically Extracting Typical Syntactic Differences from Corpora. Literary and Linguistic Computing 26(1).

[15] en.logilogi.org/Wybo_Wiersma/User/Com_Lin_Too.

[16] Sanders, N. C. 2007. Measuring Syntactic Differences in British English. In Proc. of the Student Research Workshop, 1 – 7. Omnipress, Madison.

[17] Pollock, J-Y. (1989). Verb Movement, Universal Grammar, and the Structure of IP. Linguistic Inquiry 20, 365-424

[18] Platzack, C. and A. Holmberg, 2008. The Scandinavian Languages. In G. Cinque and R. Kayne (eds). The Oxford Handbook of Comparative Syntax. New York: Oxford University Press.

[19] Zwart, J.W. 2008. Continental West-Germanic Languages. In G. Cinque and R. Kayne (eds.). The Oxford Handbook of Comparative Syntax. New York: Oxford University Press.

1 The number of languages and syntactic phenomena in the project can be extended or reduced if necessary.