Lexical Profiles According to the CEFR: What Does Research Say?

23 June, 2022


One of our most widely used tools, the Text Inspector Scorecard uses widely respected metrics to explain the level of text according to the CEFR (the Common European Framework of Reference). This is the most widely used reference point in the world for language levels, especially at an academic level.

To find out exactly how useful these metrics are to language learners, the British Council, Open University and Oxford University carried out some in-depth research. 

One of the core researchers, Dr Nathanial Owen (EdD), Senior Research and Validation Manager at Oxford University has summarised the findings in a blog post for us below.

Keep reading to find out more about lexical profiles, how and why we develop lexical profiles of learner writing and the role lexis plays in language learning. 

This blog post is developed from a report produced for the British Council by Dr Nathaniel Owen, Dr, Prithvi Shrestha and Professor Stephen Bax. See the full report here.

What are Lexical Profiles? 

Lexical profiles are descriptions of specific levels of language users, expressed in terms of the kinds of language they use. 

To create lexical profiles, we need a framework for developing language proficiency, for example, the levels of the Common European Framework of Reference for Languages (CEFR) (Council of Europe, 2001, 2018). 

Lexical profiles can be expressed in terms of key metrics such as those produced by Text Inspector, at each of the levels (A1 – C2).

How can we develop lexical profiles of CEFR levels for student writing?

We can create lexical profiles using large amounts of data taken from learner writing or speaking activities. 

Although Text Inspector can also be used to analyse texts written by learners and texts for learners to read or listen to, this blog post focuses solely on written texts. 

It provides an overview of the efforts to develop lexical profiles of student writing across multiple levels of the CEFR by using the metrics utilised in the Text Inspector language analysis tool. 

So, how do we develop the lexical profiles? 

We start by taking a large number of samples of student writing, separate them by level (e.g. A1, A2, B1 etc.) then analyse them using Text Inspector.

We can then perform statistical analysis to see which Text Inspector metrics are the most successful at identifying key differences across the levels.  

Lexical Profiles students

The importance of lexis in learner writing

Text Inspector provides many metrics of lexis, such as the English Vocabulary Profile (EVP) or the British National Corpus (BNC) and the Corpus of Contemporary American English (COCA)

Knowledge and use of lexis is a particularly important part of successful language learning. Studies show a clear link between vocabulary knowledge and writing proficiency.

Why develop lexical profiles of learner writing?  

We wanted to find out which Text Inspector metrics were most sensitive to changes in learner writing proficiency across CEFR bands.

These can be used to develop lexical profiles for learners at different levels, helping both students and teachers. 

This is important for the following reasons:

  • Students can see how they are progressing and whether there are specific areas of writing they need to develop in order to progress to the next CEFR band. 
  • Test developers can use lexical profiles of student writing as part of validation argument to show how their tests elicit language at different levels of proficiency. 
  • Test developers and teachers can also use these metrics as part of assessor training. As a result, the people responsible for marking student writing can use this information to help inform their decisions about what scores to award specific samples of writing. 
  • More sensitive metrics can be used to refine computational models in order to improve the performance of automated scoring engines.

What we looked at 

The British Council provided 6,407 samples of learner writing from the Aptis test, representing more than one million words of learner writing and learners from 65 countries. We used the Text Inspector tool to analyse these samples. 

[Find out which metrics Text Inspector can use by visiting their Features page here

These included things like the number of words, number of sentences, number of syllables, average sentence length and average word length. 

We also looked at vocabulary use by comparing vocabulary used by test-takers to data from the BNC, COCA, the EVP, the Academic Word List (AWL), and metadiscourse markers

We then compared results according to the CEFR level awarded to each of the samples, using statistical analysis.

Lexical Profiles how we did it

What we found

Text Inspector can detect changes in learner writing systematically as the CEFR level of learners increases. We identified twenty-six metrics which were most useful in distinguishing across CEFR boundaries, including: 

  • Measures of text length (e.g. sentence, token and type count)
  • Measures of sophistication (e.g. syllable count and number of words with more than two syllables)
  • Measures of vocabulary use (fourteen of the 26 metrics represented vocabulary use)

[For full details of the measures and our findings, see the report here]

The use of lexis 

As we can see, learners with a higher proficiency in English are more likely to use words with more than two syllables than writers with a lower proficiency.

It’s interesting to note that the use of these words increases significantly between B1, B2 and C levels, whereas A1-B1 bands do not see much change. 

Likewise, the overall proportion of syllables per 100 words generally increases with each CEFR band, but the differences from B1 to C levels are larger than A1 to B1 levels.

CEFRAverage % words with more than 2 syllablesAverage syllables per 100 words

Additionally, the study showed evidence of how lexis use changes across the CEFR bands. 

See the figure below from the English Vocabulary Profile (EVP) which shows that learners’ use of higher-level lexis increases with each CEFR band. Notice that it also shows that these changes are sometimes small and irregular.

Average token %


The table shows that the proportion of basic lexis (A1) decreases with CEFR band, but by a very small amount (from 72.45 to 69.52%). 

This is because all writing requires the use of grammatical function words (e.g. and, the, in, on, so etc.), regardless of CEFR level. 

Very infrequent lexis (C1 and C2) hardly occurs at all, even among very proficient writers.

Even for C-level writers, the proportion of C1 and C2 lexis they use amounts to only 1.25 % of their total output (1.02% + 0.23% from the bottom two boxes in the above table).

The biggest changes occur with A2 and B1 lexis. 

Accordingly, use of A2 lexis is most useful for discriminating between A1 and A2 learners and use of B1 lexis is most useful for discriminating between B2 and C level learners.

However, these changes are small. Given the average number of words produced by learners in the research is around 250 words, the percentage differences across CEFR levels may amount to a total of only 2-3 words in most cases. 

Lexis data is therefore useful, but cannot be used in isolation to determine CEFR level of learner writing.   

Use of Metadiscourse

We found evidence that the use of metadiscourse changes significantly across CEFR bands, and that further investigation of this area is justified. 

We found that overall use of metadiscourse markers (tokens) peaks at B1 level (21.49 % and then falls at B2 and C levels. However, we also found that the range of metadiscourse (types) increases with CEFR level, peaking at B2. See the table below.


CEFRToken %Type %

This means that more proficient writers use fewer metadiscourse markers overall, but use a greater variety in comparison with less proficient writers. This data suggests that the way in which learners use metadiscourse varies across CEFR level. This is something we plan to investigate further.

[Text Inspector is unique in providing measures of metadiscourse> use in writing samples. For more information about how Text Inspector analyses Metadiscourse, visit the Metadiscourse page here]


We have shown that Text Inspector can provide valuable information about the lexis that learners use in their writing as they progress. We have also seen that the metrics are sensitive to changes in language proficiency. 

However, the research also revealed that some changes across CEFR bands are smaller than might be expected.

A limitation of this type of analysis is that there is no information on context or appropriateness of language use. Other aspects of linguistic competence are very important in judging learner writing, such as their ideas (task completion), organisation, coherence, register and tone.

We hope that this is just the start of empirical investigations into lexical use across CEFR bands.

This blog post provides a brief overview of the findings of a large-scale research study. 

For more information about this research and other research projects related to assessment, please visit the British Council website here. 

Want to find out the CEFR level of your text? 

Click here


Related Posts

mimi thian vdXMSiX n6M unsplash 980x735 1

Syllable Count – What Does the Number of Syllables in a Text Show?

23 June, 2022

A syllable is a single, unbroken sound (phoneme) that is found in a written or […]

photo 1581474859688 82ea214f852c 980x653 1

Lexical Density vs Lexical Diversity: What’s the Difference?

24 June, 2022

Lexical diversity and lexical density are two ways in which we can analyse a text and […]

students 1807505 1280 1080x675 1

How to Select the Best EFL Materials for Your Classroom

23 June, 2022

Finding the best language learning materials for your EFL class can be a challenging task, […]