Analysing vocabulary using the British National Corpus (BNC)
The BNC tool in Text Inspector uses the British National Corpus, a corpus of over 100 million words of British English, about which you can find more here.
Text Inspector analyses your text using the BNC exact frequency rank, instead of using word families as with other tools.
We prefer not to use word families because of doubts about the notion that once a learner has learnt the base word (e.g. develop), all other words in the family (e.g. development, developmental, underdeveloped) are at the same level of difficulty. This is not, in our, view a reliable assumption.
For an interesting discussion which also questions the use of word families, see:
Schmitt, N. and Zimmerman, C. (2002) ‘Derivative word forms: What do learners know?’ TESOL Quarterly 36, 2: 145-171. You can find the article here.
As these authors say:
“Some teachers and researchers may assume that when a learner knows one member of a word family (e.g., stimulate), the other members (e.g. stimulant, stimulative) are relatively easy to learn. Although knowing one member of a word family undoubtedly facilitates receptive mastery of the other members, the small amount of previous research has suggested that L2 learners often have problems producing the various derivative forms within a word family.” (Schmitt, N. and Zimmerman, C. 2002:145)
The version of the BNC used in Text Inspector is that set out in the Corpus of Contemporary American English dataset. You can find out more about it on this page:
As discussed on that same page:
“COCA and the BNC complement each other nicely, and they are are only large [sic], well-balanced corpora of English that are publicly-available. The BNC has better coverage of informal, everyday conversation, while COCA is much larger and more recent, which has important implications for the quantity and quality of the data overall.
Unless one is inherently interested in only British or American English, there is really no reason to not take advantage of both corpora.”
There are some differences:
“The BNC has a much wider range of spoken sub-genres, while COCA is composed of unscripted conversation on TV and radio shows ……… Both corpora are very well balanced in terms of sub-genres for the written genres (e.g. Newspaper-Sports, or Academic-Medicine). In addition, because there is a diachronic aspect to COCA (coverage over time), in COCA the distribution of 20% in each of the five genres stays constant from year to year.”