Ask Your Question

What are some English corpora which have updated taboo language?

asked 2015-05-25 03:31:24 -0400

shellyshellz gravatar image

updated 2015-05-25 03:38:13 -0400

I realise that there is COLT, which is part of the BNC. The thing is these two corpora are getting a bit old and I'm wanting some authentic data of more updated taboo language and new slang etc.

I'm trying to draft a PhD proposal and am currently stuck at finding data. :(

edit retag flag offensive close merge delete

2 answers

Sort by ยป oldest newest most voted

answered 2015-05-25 05:55:03 -0400

updated 2015-05-25 05:56:02 -0400


Corpora with "updated" (though it depends on what you mean by "updated" I guess) taboo language are not easy to find, as naturally occurring instances of swearing/taboo language are hard to record to begin with. It also depends on what you are looking for. Do you want to analyze taboo language in spoken or written contexts? Do you want to focus on a specific region? I think that all these kinds of questions need to be answered before looking for corpora.

I was in the exact same situation two years ago, and I found no available corpus meeting all my requirements (but they were relatively specific), so I eventually decided to build my own corpus through Twitter. However, some of the corpora which may still have been interesting in this regard are:

  • The Scottish Corpus of Texts and Speech
  • The Glowbe Corpus, though only a very small fraction of it will be interesting for swearing in my opinion
  • The Limerick Corpus of Irish English, though I don't know the extent to which swear words are present in this one

I hope this helps, and if anyone has updated information on these kinds of corpora, I would be very interested in learning about it too!

If you want to discuss more about it, feel free to contact me on my academia profile (, I'd be glad to know more about the details of you research, or to exchange ideas!

Good luck!


edit flag offensive delete link more

answered 2015-05-25 06:44:26 -0400

Costas gravatar image

You can also try two corpora of teenage London English: Linguistic Innovators, and Multicultural London English -- both available via Skegtch Engine:

For corpus-based studies using the Linguistic Innovators data, see:

Torgersen, E., Gabrielatos, C., Hoffmann, S. & Fox, S. (2011). A corpus-based study of pragmatic markers in London English. Corpus Linguistics and Linguistic Theory, 7(1), 93-118.

Gabrielatos, C., Torgersen, E., Hoffmann, S. & Fox, S. (2010). A corpus-based sociolinguistic study of indefinite article forms in London English. Journal of English Linguistics, 38(4), 297-334.


edit flag offensive delete link more


Thanks Costas!

shellyshellz gravatar imageshellyshellz ( 2015-05-29 00:26:22 -0400 )edit
Login/Signup to Answer

Question Tools

1 follower


Asked: 2015-05-25 03:31:24 -0400

Seen: 1,977 times

Last updated: May 25 '15