Ask Your Question

Revision history [back]

Hi Annette,

if you want to build your own corpus from web content, you would have to make sure that this is legal in your country and that the owners allow that. Once you are certain that some web-content that you want to use can legally be copied to your computer and used for research or education maybe, you could use some tool that is based on wget to fetch the web-content via URL. wget is pretty powerful and can fetch an entire site with a lot of sub-pages. You should be able to find a lot of tutorial online, how to use wget, either via command line or in some graphical frontend (for Macs or Windows). Use your favorite search engine to figure out where to get wget for your favorite operating system (and it might actually be already installed, if you are using some Linux or Mac).

One way to convert HTML-pages (common webpages) to some corpus format would be to either transfer the page encoding to some XML language, for example the XML of the Text Encoding Initiative (TEI). You can use oXgarage for that, i.e. it can convert HTML, OpenOffice/LibreOffice or Word documents to TEI XML. This would allow you to handle meta-information and additional markup, e.g. linguistic annotations, either as inline or standoff annotations. There used to be some plugin for OpenOffice/LibreOffice that allows you to open up HTML, Word, ODF or other texts in OpenOffice and export those directly to TEI XML.

You can also use simple HTML-to-text converters to create raw text-files in some Unicode encoding, e.g. UTF-8. Simple tools like unhtml can be found online for various operating systems, that allow you to remove all HTML-tags from the files and leave just the raw text content for further analysis. HTML-files can be opened with the common text processing software (OpenOffice/LibreOffice, MS Word, etc.) and exported to raw text.

For simple tools to extract collocations, frequency profiles, concordances, Laurence Anthony has some software collection here, e.g. AntConc and other tools. You will find other tools online, if you search for concordance and corpus analysis tools etc.

If you want to process the corpus and data linguistically, that is tokenize the text, break it into sentences, generate lemmata for the tokens, or part-of-speech tags, or even syntactically parse the texts, you should maybe look at the Stanford CoreNLP tools, the GATE text mining tools, etc.

Best wishes