2018-01-17 21:01:19 -0400 answered a question ELAN users: can you automatically segment a video into time chunks?

ELAN2split alows the segmentation of audio and transcript (see you could add an

2015-04-29 01:09:07 -0400 answered a question Are determiners a separate part of speech?

The arguments for a class of elements called determiners to be an independent class of part of speech would be mainly syntactic. I recommend taking a look at Steven Abney's PhD-thesis:

and a lot more work on DPs since then...

One could argue that determiners are a separate word class based on their syntactic properties, in English, but also in other languages, e.g. a class of mutually exclusive elements in a specific syntactic position in relation to head nouns. In English typical D elements would be articles, demonstratives, possessive pronouns, pronouns, some quantifiers, etc. The arguments for all those elements being determiners are structural, mutual exclusiveness, selectional properties (e.g. wrt. head noun), etc.

Articles and pronouns are mutually exclusive:

the house

* the he (he here not in the Stephen King sense, or to flip it around in the The The sense)

Articles and possessives are mutually exclusive:

the house

his house

* the his house

Some nouns are not compatible with certain articles:

he bought the furniture

* he bought a furniture

Determiners precede nouns and pre-nominal adjectives:

the green tree

* green the tree


Thus the assumption that Ds are heads of DPs that take NP complements:

[DP D [NP N ] ]

This might be the only part of speech that is motivated by mostly syntactic properties, rather than for example lexical or morphological (e.g. word-formation paradigm) ones.

2015-04-28 20:04:27 -0400 answered a question Corpus analysis software program

See the answer here:

for example AntConc should be useful, but also general Unix command line tools can allow you to process or analyze language corpora for other languages, assuming that you are talking about other scripts or text encodings (e.g. some Unicode standard). Otherwise, you can always code simple tools in Python or Java yourself. You might want to look into NLTK for Python, which can process various file formats and Unicode encoded texts.


2015-04-28 12:44:58 -0400 answered a question Straightforward software for forming a corpus of on-line texts

Hi Annette,

if you want to build your own corpus from web content, you would have to make sure that this is legal in your country and that the owners allow that. Once you are certain that some web-content that you want to use can legally be copied to your computer and used for research or education maybe, you could use some tool that is based on wget to fetch the web-content via URL. wget is pretty powerful and can fetch an entire site with a lot of sub-pages. You should be able to find a lot of tutorial online, how to use wget, either via command line or in some graphical frontend (for Macs or Windows). Use your favorite search engine to figure out where to get wget for your favorite operating system (and it might actually be already installed, if you are using some Linux or Mac).

One way to convert HTML-pages (common webpages) to some corpus format would be to either transfer the page encoding to some XML language, for example the XML of the Text Encoding Initiative (TEI). You can use oXgarage for that, i.e. it can convert HTML, OpenOffice/LibreOffice or Word documents to TEI XML. This would allow you to handle meta-information and additional markup, e.g. linguistic annotations, either as inline or standoff annotations. There used to be some plugin for OpenOffice/LibreOffice that allows you to open up HTML, Word, ODF or other texts in OpenOffice and export those directly to TEI XML.

You can also use simple HTML-to-text converters to create raw text-files in some Unicode encoding, e.g. UTF-8. Simple tools like unhtml can be found online for various operating systems, that allow you to remove all HTML-tags from the files and leave just the raw text content for further analysis. HTML-files can be opened with the common text processing software (OpenOffice/LibreOffice, MS Word, etc.) and exported to raw text.

For simple tools to extract collocations, frequency profiles, concordances, Laurence Anthony has some software collection here, e.g. AntConc and other tools. You will find other tools online, if you search for concordance and corpus analysis tools etc.

If you want to process the corpus and data linguistically, that is tokenize the text, break it into sentences, generate lemmata for the tokens, or part-of-speech tags, or even syntactically parse the texts, you should maybe look at the Stanford CoreNLP tools, the GATE text mining tools, etc.

Best wishes


2015-04-09 13:17:20 -0400 asked a question Audio equipment recommendations on a budget

On the EMELD pages there are some recommendations for audio equipment. The content is obviously outdated. Does anybody have some suggestions for the selection of recorder and microphone for speech recordings, fieldwork etc.? If you have some ideas or recommendations, let us update this page:

Thanks a lot!

2015-03-20 00:02:54 -0400 answered a question Which conferences or documentaries would you recommend?

The Linguistic Society of America (LSA)has some nice guide here:

or more of that here:

You have probably found already the YouTube Virtual Linguistics Campus, and many other such sites.

2015-03-19 16:40:57 -0400 asked a question Island theory and discontinuous islands in Minimalist Program

What has replaced Island Theory in MP?

How does MP deal with the Cavar & Fanselow or Fanselow & Cavar examples of discontinuous islands:


Ivan je ušao u bijelu kuću.
I. be enter in white house
"Ivan went into the white house."

No extraction from PP (u bijelu kuću):

* Ivan je ušao kuću u bijelu _
I be enter house in white

* Ivan je kuću ušao u bijelu _
I be house enter in white

* kuću je Ivan ušao u bijelu _
house be I enter in white

But split of PP is possible:

U bijelu je Ivan kuću ušao.
in white be I. house enter



2015-03-19 15:53:45 -0400 edited answer What language is this?

I believe it's Tibetan, written in a particular cursive (dbu med "headless") style, and says "om mani padme hum". Compare the second Tibetan line here:

-- Norvin Richards

I'm hoping one of my colleagues will have an answer. I am stumped. My best guess is that It could be a Middle Eastern script or an Indic one (i.e. South Asian or Southeast Asian). I have a suspicion that it is using vowel marks... but it's not a guarantee.

One place to look at different scripts is Omniglot - However, this tattoo may be a very calligraphic style.

I do have to add that non-Western script tattoos are notorious for being inaccurate. There are blogs devoted to mistranslated tattoos.

If this is an interest in getting a tattoo, I would recommend looking at information about non-Western calligraphic styles and seeing what designs are available. The translations are likely to be more accurate.

Ideally, you would want to check with a native speaker also, or at least Wikipedia.

Sincerely hope this helps.

Elizabeth J Pyatt

Hi, Divya,

Well, I tried various ways of clicking on the URL, and they all came back with 'Page not found'. Either Dr. Pyatt had better luck than I did, or the site does not accept such consultations from 'furriners'.


James L. Fidelholtz Graduate Program in Language Sciences Instituto de Ciencias Sociales y Humanidades Benem'erita Universidad Aut'onoma de Puebla, M'EXICO

