How "Phrases in English" Was Made

Modification and Normalization of the BNC Data

To interpret query results correctly users of this site should understand the normalization conventions followed in compiling the PIE database. Generally speaking, these conventions were adopted to limit the number of non-essential distinctions in the data both to permit linguistic patterns to emerge more clearly and to improve database performance.  These conventions and the exclusion of items falling below the frequency threshold of three mean that queries against this database can yield different frequency counts than queries directly against the BNC database.

  1. Upper-case characters were converted to lower case.  Since proper nouns have different POS tags from common nouns, e.g. Pole can still be distinguished from pole, but there is no marking of proper adjectives.
  2. Accented characters (café, façade) were mapped onto their "plain" equivalents (cafe, facade), both because source texts are not always consistent or correct in use of diacritics and because entry of plain characters queries is easier from English-language keyboards. Other SGML character entities were treated variously as detailed here.
  3. All numerals of any magnitude and degree of precision were mapped onto a single #, so both 31,298,435 and 0.0095 appear as #, and ranges of numerals like 1931-35 appear as #-#. The primary motivation was to highlight lexical patterns involving numbers and dates which would otherwise be obscured by the large number of variants. About 1 in 50 "words" in the BNC is a sequence of numerals, and the data contain about 45,000 different sequences, of which 60% occur just once; fewer than half the numbers meet the 3 times or more frequency cutoff for inclusion in the PIE database. In the follow-on phase an additional database is planned specifically to study numerals and numbers in the BNC data.
  4. Each token identified by the CLAWS parser with a lexical or morphemic POS tag (i.e. not a punctuation tag) was treated as a "word". "Multiword units" (such as in spite of) are joined by underscores, not spaces (in_spite_of). "Fused" forms, both contractions like isn't and possessives like boy's, are separated into their components:  is n't, boy 's.  Somewhat inconsistently, can't was de-fused into can n't, but won't, ain't were separated into wo n't, ai n't.  Following BNC's usage, variant spellings without or with space are treated one or two words respectively, and hyphenated variants are treated as a single distinct word:  data base (two words); database, data-base (distinct one-word types).
  5. Punctuation marks other than quote and comma were treated as an end-of-phrase markers, as were segment boundaries assigned by the CLAWS parser, and word-external punctuation was stripped.
  6. All sequences of 1-8 "words" and POS tags in each phrase (see 5) were isolated and tallied to construct the database.  Words and n-grams occurring less than three times in the entire corpus were dropped.  From this database phrase-frames were derived and tallied.  All phrase-frames with two or more variants were retained.

The Words and Phrases database uses the MySQL database server with a Web user interface programmed in PHP.  The text normalization procedures described above are programmed in PowerBasic incorporating routines coded by the developer for KWiCFinder and kfNgram, which generates phrase-frames and chargrams as well as n-grams.