Corpus Linguistics Glossary
Terms and Definitions
Alias: A user-designated synonym for a Unix command or sequence of commands. For example, if you designated m to be your alias for mailx, then typing m will always run this mail program.
Alignment: The matching or linking of a text and its translation(s), usually paragraph by paragraph and/or sentence. Texts are often aligned in this way so that bilingual CONCORDANCES can be retrieved. Some alignment can be done automatically by software, although best results are usually produced when a human user checks the automatic alignment and corrects where necessary.
Alphanumeric: Of ASCII characters, any string composed of only up-or lower-case English letters or Arabic numerals.
AMALGAM (Automatic Mapping Among Lexicon Grammatical Annotation Models)
Anaphora: Pronouns, noun phrases, etc. which refer to something already mentioned in a text; sometimes the term is used more loosely—and, technically, incorrectly— to refer in general to items which co-refer, even when they do not occur in the text itself (exophora) or when they refer forwards rather than backwards in the text (cataphora).
Annotation: (1) The practice of adding explicit additional information to the machine-readable text; (2) The physical representation of such information.
ARCHER: a Representative Corpus of Historical English Registers
ASCII: The American Standard Code for Information Interchange is a standard character set that maps character codes 0 through 127 (low ASCII) onto control functions, punctuation marks, digits, upper case letters, and other symbols.
Attribute: In SGML, a quantifier within the opening tag for an element which specifies a value for some named property of that element.
Authenticity: a feature that characterizes naturally occurring corpus data
BFT (Binary File Transfer): A way of sending files by ftp. The are sent in binary code, not translated into ASCII, which would risk some information loss.
CALL: computer-aided (or assisted) language learning
CAMET: Computer Archive of Modern English Texts, a project of Geoffery Leech of the Department of Language and Modern English in 1970.
Character encoding: a system of using numeric values to represent characters
COCOA (Computations in Commutative Algebra): A method of text encoding used by the Oxford Concordance Program and other software.
Colligation: the collocation of a node word with a particular grammatical class of words
Collocation: the characteristic co-occurrence of patterns of words
Comparable corpus: a corpus which is composed of L1 data collected from different languages using the same sampling techniques
Comparative corpus: a corpus containing components of varieties of the same language
Concordance: an alphabetical index of a search pattern in a corpus, showing every contextual occurrence of the search pattern
Corpus balance: the range of different types of language that a corpus claims to cover
Corpus header: the part of a corpus that provides necessary bibliographical information, taxonomies used and other metadata relating to a corpus
Corpuses: a less commonly used plural form of corpus
Cross-tabulation: a table showing the frequencies for each variable across each sample
co-text A more precise term than context or verbal context used to refer to the words on either side of a selected word or phrase.
Dispersion: a term in descriptive statistics which refers to a quantifiable variation of measurements of differing members of a population within the scale on which they are measured
Ditto tag: in corpus annotation assigning the same part-of-speech code to each word in an idiomatic expression
DTD: Document Type Definitions in markup languages such as HTML, SGML and XML
Error-tagging: assigning codes indicating the types of errors occurring in a learner corpus
Factor analysis: a statistical analysis commonly used in the social and behavioural sciences to summarize the interrelationships among a large group of variables in a concise fashion fisher's exact test: an alternative to the chi-square or log-likelihood test that measures exact statistical significance level
Frequency: also called raw frequency, the actual count of a linguistic feature in a corpus
Interlanguage: the learner’s knowledge of the L2 which is independent of both the L1 and the actual L2
Keyword: words in a corpus whose frequency is unusually high (positive keywords) or low (negative keywords) in comparison with a reference corpus
KWIC: key-word-in-context concordance
Lemmatization: grouping together all of the different inflected forms of the same word
Lexicon: an inventory of word forms in a given language
Log-likelihood test: also known as an LL test, an alternative to the chi-square test
Markup: a system of standard codes inserted into a document stored in electronic form to provide information about the text itself and govern formatting, printing or other processing
Mean: the arithmetic average, which can be calculated by adding all of the scores together and then dividing the sum by the number of scores
Merger: combination of two or more words (e.g. can’t and gonna)
Metadata: a term used to describe data about data, typically the contextual information of corpus samples
MI: mutual information, a statistical formula borrowed from information theory
Microconcord: a concordance package published the Oxford University Press
Monitor corpus: a corpus that is constantly supplemented with fresh material and keeps increasing in size
Normalization: a process which makes frequencies from samples of markedly different sizes comparable by bringing them to a common base
Parallel corpus: a corpus which is composed of source texts and their translations in one or more different languages;also known as a translation corpus
Parsing: also called treebanking or bracketing, a process that analyzes the sentences in a corpus into their constituents
Population: the entire set of items from which samples can be drawn
POS: part-of-speech
Post-editing: human correction of automatically processed data
Range: the difference between the highest and lowest frequencies
Reference corpus: a balanced representative corpus balanced for general usage; in keyword analysis, a corpus that is used to provide a reference wordlist
Representativeness: a corpus is thought to be representative of the language variety it is supposed to represent if the findings based on its contents can be generalized to the said language variety
recoverability A term used to refer to the possibility for the user to recover the basic original text from any text which has been annotated with further information.
RP: Received Pronunciation, the notional standard form of spoken British English
Sample: elements that are selected intentionally as a representation of the population being studied
Sample corpus: as opposed to a monitor corpus, a sample corpus is of finite size and consists of text segments selected to provide a static snapshot of language
Semantic prosody: the collocational meaning arising from the interaction between a given node word and its collocates
SEU: Survey of English Usage
Skeleton parsing: also called shallow parsing, a parsing technique that uses less fine-grained constituent types rather than would be present in a full parse
Sort: arrange concordances or a wordlist in a certain order
Span: a term used to refer to the measurement, in words, of the co-text of a word selected for study.
Specialized corpus: a corpus that is domain or genre specific and is designed to represent a sublanguage
SPSS: Statistical Package for the Social Sciences
Standardized type-token ratio: similar to type-token ratio, but computed every n (e.g. 1,000) words as the WordSmith Wordlist goes through each text file
Sub corpus: a component of a corpus, usually defined using certain criteria such as text types and domains
Tagging: an alternative term for annotation, especially word-level annotation such as POS tagging and semantic tagging
Tagset: a collection of tags in the form of a scheme for annotating corpora.
Text chunking: the practice of dividing sentences into non-overlapping segments on the basis of fairly superficial analysis
Token: an occurrence of any given word form
Tokenization: also called segmentation, a process that divides running text into legitimate word tokens, especially important for languages such as Chinese that do not delimit words with white spaces
Transcription: converting spoken data into a written form
Treebank: an alternative term for a parsed corpus
T-test: an alternative statistical test to the chi-square test
Type: a word form
Type-token ratio: the ratio between types and tokens, useful when comparing samples of roughly equal length
Unicode: a character encoding system designed to support the interchange, processing, and display of all of the written texts of the diverse languages of the world
Wildcard: a special character such as an asterisk (*) or a question mark (?) that can be used to represent one or more characters in pattern matching
Wordlist: a list of words occurring in a corpus, possibly with frequency information
WordSmith: a corpus exploration package with sophisticated statistical analysis, published by the Oxford University Press
Z-test: an alternative statistical test to chi-square test
References
Baker, Paul, Andrew Hardie & Tony McEnery. A Glossary of Corpus Linguistics. Edinburgh: Edinburgh University Press, 2006.
Olohan, Maeve. Introducing Corpora in Translation Studies. New York: Routledge, 2004
Wang, Kefei. Research and Application of Bilingual Parallel Corpora. Beijing: Foreign Language Teaching and Research Press, 2004