Ask Your Question

Looking for the right corpus data

asked 2015-07-27 05:27:52 -0500

Eileen gravatar image

For my PhD research I am building a computational model that predicts how two languages converge in the bilingual mind. With this model, I wish to establish what elements of language are more likely to converge and therefore more vulnerable for contact-induced change.

Put simply, the model looks at the parental input that is coded for morphosyntax and semantics and based on the structural and conceptual overlap the model produces output that should correspond to the actual linguistic output of the child. In order to do so, I need a corpus that contains a reasonable amount of parental input in two languages (where one parent speaks one language and the other parent another language) and of course the child’s output. Furthermore, as contact-induced change is probably more visible in languages that do not share a large part of their structures I would like to look at bilingual data from languages of different language families.

Are you aware of a transcribed corpus that contains these data? A corpus that contains: a reasonable amount of bilingual input using the one parent one language policy (OPOL), a combination of languages from two different language families and the child’s output (it does not matter in what age range).

Thanks a lot!

edit retag flag offensive close merge delete

1 answer

Sort by » oldest newest most voted

answered 2017-03-11 15:59:26 -0500

Hi, Eileen,

Rather than 'converging', it appears that bilinguals, from the very earliest stages, keep both (or more) languages pretty well compartmentalized. (And they also know which language to use to whom &/or whether and how to use code-switching). It sounds like you are operating from different suppositions, something like 'language mixing', which I think is now accepted as being very unusual, rather than the norm. Even in what we call code-switching, the changes from one code (language) to the other are almost always done in 'chunks', that is, at major syntactic boundaries. (There are some exceptions to this, such as pidgins and a few other cases, usually where there is insufficient exposure to one of the languages.)

Probably the largest corpus for (mainly child) language acquisition is CHILDES (, which is multilingual and transcribed in ASCII (text). On the site are also programs to analyze and manipulate the data, much of which is of bilingual children, though I don't know how much is with both languages. If this corpus is unsuitable for you (IMO unlikely), try Googling to find what might be more suitable. There are also corpora of language learners (also, e. g., ESL).


James L. Fidelholtz Posgrado en Ciencias del Lenguaje Instituto de Ciencias Sociales y Humanidades Benemérita Universidad de Puebla Puebla, Puebla, MÉXICO

edit flag offensive delete link more
Login/Signup to Answer

Question Tools

1 follower


Asked: 2015-07-27 05:27:52 -0500

Seen: 502 times

Last updated: Jul 27 '15