Here chargram designates sequences of n characters within word-forms. Please note the following
- The chargrams database is derived from all word-forms occurring 10 or more times in the
BNC data as normalized on
this site, almost 120,000 types, and includes over 540,000 chargrams.
- To avoid counting word-forms more than once, initial position overrides
final position. For example, the word the counts only for
initial position of the 3-chargram the (it could conceivably also be
construed as occurring finally). The 3-chargram the
occurs finally in bathe and medially in other; both are tallied
- Since there are many more tokens than types, a higher minimum frequency
may be desirable to limit dataset size. (For types, each
instance of a chargram in a word-form counts only once. For tokens
these frequencies are multiplied by the total number of times the word form occurs
in the corpus.)
- Select "n = 1-8" to match a range of chargram sizes. If you specify a filter, wildcard characters
are added to ensure matches across the range.
Tip: to match limit matches to 2 or more characters, add the appropriate number of ? to the filter field.