Getting Started with
"Phrases in English"

Before you start...

Please take a moment to read the FAQ to understand what this site means by n-grams and phrase-frames, and how the BNC defines words and tags them grammatically with POS codes. Then familiarize yourself with the normalization conventions which are specific to this database. Finally, please remember that this site contains only a subset of the words and phrases in the BNC, those occurring three times or more.

The drop-down menu under "Grams" affords access to all the query interfaces.  The

The following discussion shows screen shots from the "Explore N-Grams" page. Important differences for "Explore Phrase-Frames" are outlined at the end.

Clicking on one of the "Explore..." links launches a frameset with a query pane on the left and blank pane for query results on the right.  If either side is too narrow, a scrollbar appears at the bottom of the window. The relative sizes can be changed by clicking on the divider, holding down the mouse button, and dragging the divider to resize the panes.

While numerous [Query] buttons are strewn around the page for convenience, they are actually unnecessary: just hit the "Enter" key to submit a query.

Selecting the value of n


Select the number of words in n-grams or phrase-frames by clicking on the radio button to the left of the number.  Try each number to develop a feeling for the kinds of datasets which match each value of  n.  A value of 1 returns a list of individual words. The highest-frequency two-word phrases tend to be fragmentary building blocks of language, including "de-fused" contractions like do n't, I 'm; other useful groups of 2-grams are compound nouns written as two words as well as adjective-noun and adverb-adjective collocations.  Recurring 3-grams and 4-grams are often (almost) complete familiar phrases. The larger n is, the larger the number of distinct n-grams in the corpus grows, while the percentage of n-grams meeting the cutoff criteria (here: minimum of 2 occurrences in the corpus) declines. For n values of 5 and greater, formulaic expressions restricted to highly specific circumstances become increasingly prominent in the data, and the total frequency of any given 5- or 6-gram is relatively small. For example, there are almost 1.6 M 3-grams occurring 5 or more times, and over 1.3 M 2-grams; in contrast, less than half the former number of 4-grams cross the threshold, and the number of 6-grams that qualify is less than 4% of the figure for 3-grams. Values of n greater than 6 almost exclusively reflect formulaic language and quotations.

Note the links to jump down to the word-form and POS-code filters, as well as the ubiquitous [Query] button. 

Display options and numeric query conditions


The options in this section display "tool tips", i.e. short explanations that pop up when you hover your mouse cursor over them. Several display options appear in the right-hand column.  As a general rule, for efficient querying you should limit the display options to what you actually need.

Numeric Conditions

Order specifies (you guessed it) the order in which items are displayed: from most to least frequent or vice-versa, in alphabetical order of the items, or in alphabetical order of the POS tags.  If you choose an order other than alphabetical, items with identical values for the primary sort key appear in alphabetical order of the word or phrase. Alphabetical sorts are the most efficient option. 

Focusing your search with filters


Without filters, every item that meets your numeric conditions is included in the results dataset, which degrades database response time. Filters narrow the dataset down to items which match specific criteria. You can match word-forms, POS tags, or both; if both criteria are specified for a given position, items must match all criteria (logical AND) to pass the filter.

To match alternate word-forms or POS tags, enter them all in the same field separated by spaces (logical OR).

Unfortunately there are no semantic filters.  For example, by specifying both word-form and POS tag you can distinguish the verb match from the noun, but you cannot distinguish instances in which the latter means 'sports contest' from 'incendiary device', 'corresponding entity' etc.

Both word-form and POS filters support wildcards: * matches any number of characters; ? matches one character, no more, no less.

Check the exclude box to eliminate the forms or tags you specify from the dataset: then only items which do not match your specifications are retrieved.

If there is an entry in a filter field, the field is colored light green, or else light red if the exclude box is unchecked: . These color codes make it easy to spot which fields have entries and whether they specify inclusion or exclusion.

Note the < Top  link to jump back to the top, e.g. to change the value of n.  Decreasing the value of n hides some of the filter fields, but the values are preserved and reappear when the higher value of n is restored. Careful: clicking the Clear Filters button erases everything directly, without confirmation.  It is a good idea to click this button when starting a new query lest you unintentionally carry over filters from a previous search.

Matching word forms


The Words & Phrases Database normalization process converts all characters to lower case, so matching is not case sensitive:  Pole and pole both return the same dataset.  To match more than one specific word-form, it is generally most efficient to specify all forms, separating them with spaces: country countries will match only these two forms, while  wildcard countr*  matches additional forms such as countrified, countryman, and countryside, resulting in a larger dataset with unnecessary items. You can increase the number of useful matches by specifying orthographic variants such as realise realize and database data-base. 

The corpus preserves the spelling of original texts, so compound forms might be written together as a single word, with a hyphen, or as two separate words.  Consequently you may need to run separate queries searching on up to three possibilities: Query 1, Word 1 database data-base  || Query 2 Word 1 data  Query 2 Word 2 base   or  Query 1 Word 1 much-needed  || Query 2 Word 1 much Query 2 Word 2 needed.

In the normalization process sentence punctuation was removed. Three kinds of punctuation remain within word-forms:

Using wildcards you can find forms with these punctuation marks:  *_*  matches all multiword forms with underscore, and *-*  returns hyphenated forms.  Forms with apostrophes are "de-fused" into their components; to reconstruct them look for 2-grams and specify *'* for word 2. 

Matching POS codes


The CLAWS parser assigns each "word" in the corpus a Part Of Speech (POS) code as detailed here. These codes may be specified in various ways:

Tip: to get a feeling for different ways to specify POS tags, select various entries from the drop-down list and study the codes that appear in the POS field.

Differences between n-gram and phrase-frame queries

There are several subtle differences between the "Explore N-Grams" and the "Explore Phrase-Frames" interfaces:

  1. Since phrase-frames must consist of more than one lexical unit, the smallest meaningful value of n is 2.
  2. There is one additional pseudo-POS-tag which can be matched, the "wildword" code -*- . Entering it forces the wildword to appear in the corresponding position (i.e. as the first, second etc. word).  Careful: at most one wildword may be specified per query, and each query must have at least one position for which both the word-form and POS-tag are unspecified. 
  3. Phrase-frame query results show only the phrase-frames. To see the actual variants of a phrase-frame, click on it in the results pane; the variants will appear in a new window. (If you have a pop-up window blocker, disable it for this site!)
  4. Additional numeric conditions and ordering options for phrase-frames are highlighted in green in this screenshot below. Reminder:  here as elsewhere, references to counts and frequencies imply "for the numeric conditions and word and POS filters specified".

 

Result pages

All kinds of query results appear in a "pane" next to the query form and have several clickable buttons at the top:

N-Grams

Click on any word or phrase to display a random set of 50 concordances of the item from the corpus in a separate window.
 

Phrase-Frames

Click on any phrase frame to display variants in a separate window.  The frequency cutoffs, first item and chunk size are determined by the parameters specified in the phrase-frame uqery.

 

Phrase-Frame Variants

Click on any word or phrase to display a random set of 50 concordances of the item from the corpus in a separate window.