The tagger tool helps us analyse the grammar of any given text and label it with parts of speech. The lexical diversity tools allow us to analyse the complexity of the text and understand the user’s skill using that language.
This makes it useful for both linguistics research and ESL teaching.
We asked him to tell us more about analysing lexical diversity, its history and, perhaps most importantly, its limitations.
Although I originally trained as a linguist and data scientist, I’ve delved more and more into computer science over the course of my academic career.
This blend of interests and skills has led me to collaborate with people in a variety of disciplines, including language acquisition studies.
Measures of linguistic diversity are of particular importance in this context because they enable researchers to assess how certain aspects of a child’s language abilities develop.
In the beginning, I worked on methods of measuring inflectional diversity (the average diversity of word forms that belong to a given lexeme or lemma), then moved onto exploring the more basic and fundamental notion of lexical diversity.
Lexical diversity measures how much the vocabulary of a language sample (or text) varies.
This can help us evaluate the complexity of the linguistic data or systems, which is why it’s most frequently used in studies on first and second language learning, as well as in quantitative approaches to language typology.
All ways of measuring lexical diversity in a sample (or text) are based on the sample’s lexical variety, that is the number of distinct words it contains.
Clearly, a sample’s variety is strongly dependent on its length, if only because it cannot contain more distinct words (types) than the total words it contains (tokens) and more generally because words are more often repeated in longer texts.
This means that we cannot directly compare samples of different lengths.
Several decades ago, we learned that the so-called “type-token ratio” is not a reliable way to cope with samples of different sizes because the relationship between the number of types and the number of tokens is not linear.
Type-token ratio (TTR) is calculated by dividing the distinct or the number of unique words (types) in a text, by the total number of words (tokens).
It can be useful in certain particular circumstances such as when comparing texts of the same or very similar size, when a smaller text has a higher TTR than a larger text or when taken as part of the calculation of more sophisticated measures such as VOCD or MTLD.
Formal accounts of lexical diversity seem to have first emerged between the 1930s and 1950s, with the rise of information theory and the first computers. Early contributions to this line of research came from various disciplines, including statistics, linguistics, biology and psychology and are associated with the names of such scholars as George Kingsley Zipf, George Udny Yule, Irving John Good, Gustav Herdan, and Pierre Guiraud.
There are many methods! But if you need to compare samples of different sizes, which is often the case, you should turn to methods that rely on some form of resampling procedure.
These have been proved to be more robust with regard to sample or text size variations.
There are three such methods that I know of: VOCD, MTLD, and resampled variety. They are all more or less sophisticated derivatives of an older index known as the ‘mean segmental type/token ratio(MSTTR)’
(This is a calculation that divides the text or sample into equal segments then calculates the average token-type ratio in all segments).
VoC-D and MTLD are two methods for measuring lexical diversity that attempt to reduce the impact of sample size variations.
Text Inspector makes use of both VOC-D and MTLD – there is a further discussion on our website of the merits of them both.
VoC-D is the most complex because it makes certain assumptions about the relation between the number of types and tokens in a sample and involves a curve fitting process.
This is a statistical calculation where a curve or mathematical function is constructed to best fit the data points gathered.*
The VoC-D method takes smaller subsamples at random (usually around 50 words) then calculates the average token-type ratio. The calculation can be carried out many times then averaged.
[*Here’s a more technical explanation: The VoC-D algorithm first computes the average TTR in subsamples of increasing size (35, 36, … , 50 tokens) to obtain a so-called ‘empirical’ TTR curve. Then it seeks the theoretical curve that matches the empirical curve most closely, among a family of possible TTR curves generated by varying a single numeric parameter. The corresponding parameter value is the result of the VoC-D measurement.]
MTLD is the most recent approach. It’s unique because it considers the original order of words in the sample.
The MTLD method is also based on the token-type ratio (TTR) of a text. the method calculates the average length of segments that have a certain TTR. The calculation is carried out twice- once from left to right and the other in reverse. These results are then averaged.
I would also like to mention resampled variety, which is basically the expected variety in subsamples of a fixed size (e.g. 100 words) drawn at random from the text under consideration.
This enters the computation of VOCD, but it doesn’t rely on the same assumptions nor does it use a curve-fitting procedure. It has the merit of simplicity and produces a result that is easy to interpret since it is actually a number of types.
The biggest factor to be aware of is that both VOC-D and MTLD increase with the degree of lexical diversity in the sample.
The unit in which the result of VOCD is expressed does not have a natural interpretation.
Also, it traditionally involves a random sampling procedure which is why its results tend to vary from one execution to the other.
Perhaps surprisingly, the result of MTLD is an average length. This is the average number of words that must be read in sequence before their type-token ratio falls below some predefined threshold (0.72).
Thus MTLD takes a fixed type-token ratio and looks for the corresponding segment length, which in a sense is the exact reverse of a measure like resampled variety, which takes a fixed length and computes the corresponding type-token ratio.
In general, users should be aware that no measure of lexical diversity is truly, strictly independent of sample size.
While we should be using measures that attempt to reduce the impact of sample size variation, they do still have limitations in their ability to deal with large sample size discrepancies.
Also, if one of the samples compared is actually the result of the aggregation of several texts, it will tend to score higher on lexical diversity than samples consisting of a single text.
I’m not aware of any measure that takes this aspect into consideration.
Over the past ten years, I’ve been working on a major project to develop software that enables non-specialist users to develop text analysis workflows adapted to their needs without having to learn a programming language.
The result is an open-source tool called Textable, which I encourage everyone interested in text analysis to download and try.
Besides this methodological work, I also use computerized text analysis to study mostly born-digital linguistic and cultural data. This includes text extracted from video games or person-to-person communication via computers.
Here are some of the most interesting points to remember from this interview:
Finding the best language learning materials for your EFL class can be a challenging task, […]Read More ->
Text Inspector’s Analysis of Highlights from the BBC News Feed The BBC has long been […]Read More ->
The teaching of English in the modern age has evolved drastically in the past decade. […]Read More ->