The LINGUIST List is dedicated to providing information on language and language analysis, and to providing the discipline of linguistics with the infrastructure necessary to function in the digital world. LINGUIST is a free resource, run by linguistics students and faculty, and supported primarily by your donations. Please support LINGUIST List during the 2017 Fund Drive (
Ask Your Question

Straightforward software for forming a corpus of on-line texts

asked 2015-04-28 11:13:23 -0400

Annette gravatar image

Hello - please can you advise me

We are just starting on a project to search Uganda on-line newspapers for themes and language use in relation to climate change from inside Uganda - so limited facilities, even for purchasing software.

So if anyone can suggest the best means of forming a straightforward corpus for fairly simple searches (words, strings etc) I would be grateful

Annette Islei

edit retag flag offensive close merge delete

2 answers

Sort by ยป oldest newest most voted

answered 2015-04-28 23:10:02 -0400

usagi5886 gravatar image

You might also find some useful tools for corpus construction by browsing the software listed on the Digital Research Tools (DiRT) directory, most of which are free. See, for example, the page on transcription.

edit flag offensive delete link more

answered 2015-04-28 12:44:58 -0400

Hi Annette,

if you want to build your own corpus from web content, you would have to make sure that this is legal in your country and that the owners allow that. Once you are certain that some web-content that you want to use can legally be copied to your computer and used for research or education maybe, you could use some tool that is based on wget to fetch the web-content via URL. wget is pretty powerful and can fetch an entire site with a lot of sub-pages. You should be able to find a lot of tutorial online, how to use wget, either via command line or in some graphical frontend (for Macs or Windows). Use your favorite search engine to figure out where to get wget for your favorite operating system (and it might actually be already installed, if you are using some Linux or Mac).

One way to convert HTML-pages (common webpages) to some corpus format would be to either transfer the page encoding to some XML language, for example the XML of the Text Encoding Initiative (TEI). You can use oXgarage for that, i.e. it can convert HTML, OpenOffice/LibreOffice or Word documents to TEI XML. This would allow you to handle meta-information and additional markup, e.g. linguistic annotations, either as inline or standoff annotations. There used to be some plugin for OpenOffice/LibreOffice that allows you to open up HTML, Word, ODF or other texts in OpenOffice and export those directly to TEI XML.

You can also use simple HTML-to-text converters to create raw text-files in some Unicode encoding, e.g. UTF-8. Simple tools like unhtml can be found online for various operating systems, that allow you to remove all HTML-tags from the files and leave just the raw text content for further analysis. HTML-files can be opened with the common text processing software (OpenOffice/LibreOffice, MS Word, etc.) and exported to raw text.

For simple tools to extract collocations, frequency profiles, concordances, Laurence Anthony has some software collection here, e.g. AntConc and other tools. You will find other tools online, if you search for concordance and corpus analysis tools etc.

If you want to process the corpus and data linguistically, that is tokenize the text, break it into sentences, generate lemmata for the tokens, or part-of-speech tags, or even syntactically parse the texts, you should maybe look at the Stanford CoreNLP tools, the GATE text mining tools, etc.

Best wishes


edit flag offensive delete link more
Login/Signup to Answer

Question Tools


Asked: 2015-04-28 11:13:23 -0400

Seen: 322 times

Last updated: Apr 28 '15