"Phrases in English"

"Phrases in English" FAQ

What are...?

n-grams: on this site n-grams means sequences of n words as defined here. In this database, n can be any number in the range 1-8, i.e. from individual words up to eight-word phrases. Only words and phrases occurring at least three times in the BNC are included here. Relatively frequent n-grams are typically familiar building blocks of English; such recurrent n-grams are also known as lexical bundles, lexical chains or clusters. <<add references>> Shorthand forms like 1-gram, 2-gram, 3-gram etc. specify the value of n; some prefer unigram, bigram, trigram etc. In information retrieval and computational linguistics contexts, the term n-gram more frequently means "sequence of n characters". Here this sense is dubbed chargrams.
phrase-frames: sets of phrases (n-grams) which are identical except for one word, dubbed the "wildword" and represented by the wildcard sign *. For example, at the * of is a phrase-frame with variants like at the start of, at the end of, at the heart of etc. Phrase-frames are useful tools for discovering phraseological patterns. Guidelines for choosing n-grams or phrase-frames are given in the tutorials. Parallel to 3-gram etc. this site uses 3-frame etc. as shorthand for "phrase-frame of three words", and p-frame is a handy stand-in for phrase-frame.
words: lexical units as identified by the BNC's CLAWS parser with POS tags, including "multiword units". "Fused forms" are split up into morphemes, each tagged as a separate word token. Orthographic variants of the same lexeme (database / data-base, realise / realize) appear as different lexical units. Compound nouns written with white-space instead of hyphens are separated into their components, so data base is treated as two lexical units.
multiword units: phrases that function grammatically as single words, e.g. conjunction so that or preposition in spite of, receive a single POS tag, so they are treated here as single words. To make this obvious in search results they are displayed with underscores instead of spaces: so_that, in_spite_of. To search for multiword units you must enter them in a single query field and use underscores, not spaces. Since spaces separate multiple words to match match in queries, the word-form filter in spite of would match in OR spite OR of.
Lists of multiword units: BNC site PIE site
fused forms: multiple morphemes written without space in English such as cannot, he'd, George's are "de-fused" by the parser into can not, he 'd, George 's. Different POS tags clarify whether 'd stands for had or would and whether 's comes from is or has, or else represents a possessive.
Lists of fused forms: BNC site PIE site
filters: query conditions which focus the matching dataset by "filtering out" unwanted items. Filtering can be done by word-forms, POS codes and / or frequency, and multiple forms can be specified to either include or exclude from the dataset.
POS-tags: "Words" in the corpus are tagged with one of 57 "Part Of Speech" codes consisting of three characters; this list of POS codes explains and gives examples of how these codes are applied. The PIE database permits searching for specific combinations of POS codes specified by either choosing from a list or entering directly; wildcards can be used to match groups of related codes. Occasionally the code UNC (unclassified) is overused, for example for the ai of ain't, which is ambiguous but could be assigned manually to the proper form of BE or HAVE.

Why...?

Why do you only support Internet Explorer?

In this initial phase the time required to develop and test for multiple browsers would detract from building the database and user interface. Webmasters report that over 85% of Website visitors use Internet Explorer (IE), and even more have access to IE on their machine. When this Website is stable and fully documented I will strive for cross-browser compatibility. Incidentally, the compact and capable Opera 7 browser supports most of the IE features on this site (and starts displaying the result much sooner than IE), and most functions also work in Netscape versions 7 and higher.

Why do I see no change in the results pane after editing the query parameters?

After changing any of the query parameters, click the "Query" button or press the "Enter" key to start a new query. (The "Next" button in the results pane continues fetching subsequent chunks of the dataset from your last query.)

Why do I only see the page heading in the results pane, but no results appear?

Depending on the total number of records that match your specifications you may have to wait up to 5 minutes for results, and your browser may even "time out" while you are waiting. Queries with no word-form or POS filters and with a low minimum frequency match the largest datasets and are thus the slowest. Some suggestions to improve performance are...

Wait up to five minutes for results display before giving up or clicking the "Query" button again; launching unnecessary or redundant queries just slows the server down.
Narrow your search with word-form and / or POS filters.
Specify a higher minimum frequency to reduce the dataset size. To study frequent phrases a cutoff frequency of 1000 (or even 100) gives much faster results.
Choose a larger "chunk" size to minimize total waiting time -- the additional time required to fetch a larger dataset is negligible, limited only by connection speed.
Specify alphabetic sort order if that works for your purposes.
Use the Opera or Mozilla browser, which display results as the browser receives them (Internet Explorer waits for all the word data to arrive before displaying it, while the others build the table incrementally and resize it as necessary).

Why does the "random concordances" function take much longer to match some phrases than others?

Ironically, the more frequent a phrase and the words in it are, the longer it takes to compile a random set of concordances. In addition, this feature takes advantage of a "fulltext" index, which improves efficiency by excluding "short" words (< 4 letters) and ones which occur in over half the sentences (e.g. a, and, the, is, are...). Finding these unindexed words takes longer than more salient words.

To improve speed, queries consisting mainly or entirely of such short and frequent words are run against a randomized database: the sentences are in scrambled order, but matching proceeds in the same order for each new round of concordances. (Nevertheless, the "Re-Query" link finds any additional matches of your search text.) This pseudo-random approach usually provides satisfactory results in a fraction of the time required for a truly random search.

You can tell which query method was used by the codes at the bottom right of the results page (ft 'fulltext index', rt 'randomized text', followed by the number of seconds the query took).

Why does the "random concordances" function return some irrelevant results?

This feature is much faster and more efficient without matching by POS code. In addition the fulltext index mentioned in the previous entry appears to be overly inclusive in its matches. Spurious matches are the cost of the greatly increased speed. Without such optimization PIE could not offer this feature: several simultaneous searches really bog the server down. Please let me know via e-mail link at the bottom of the page if you would like the option to match POS codes for this functionality (with a significant speed penalty).

Why do results show no matches for a phrase that must be in the BNC?

This question has many possible answers:

Is your minimum frequency set too high or your maximum too low? Some phrases are less frequent than you think, and setting a maximum frequency may exclude some familiar phrases. (The minimum frequency for inclusion in the database is 3; there is no maximum.)
Are you looking for phrases that are too long? Try a smaller value for n and search for a sub-phrase: 4-, 5- and 6-grams are relatively rare.
Is your query too specific? Try using some wildcards or wildwords to match a greater number of word forms.
If you have specified POS tag filters, are they appropriate for the word forms you want? Try again with no filters or filtyers with wildcards. If you checked the "exclude" box, does it make sense?
If you are an American, did you use the appropriate British spelling? Orthographic variants (e.g. -ise / -ize) have not been normalized. If you wish to query for more than one variant, enter both in the "word form" filter field, separated by a space (normalise normalize), or else use a wildcard (normali?e).

Why are there no phrase frames matching my query even though I find several variants in the database?

Phrase frames are sets of variants which are identical except for one word, e.g. all but the second word are the same. Do the variants you have observed really differ only in the (ordinally) same word?
If you specify word form or POS tag filters, leave at least one word unspecified. (You may specify -*- to force the "wildword" to appear in a specific position, but that is redundant if the other words are specified.) If you need to specify something for each word, use the "Explore N-Grams" page instead.
If you have specified POS tag filters, are they appropriate for the word forms you want? Try again with no filters. If you checked the "exclude" box, does it make sense?
Examine and possibly lower all the frequency filters

Why can't I filter search results by text-type, i.e. domain, genre and audience?

Search by text-type will be supported in a future release of the site, presumably by mid-2004.

Why don't I see the POS-tag for ___ in the drop-down list?

These lists are not all-inclusive -- that would limit their usefulness. Rather they offer a number of "super-categories" as examples of using wildcards and numeric ranges. Please refer to the list of POS codes for any word-classes not included in the drop-down box.

Why can't I save results pages with the "Save Page" or "Save Data" buttons?

These buttons require the ActiveX file system component and work only with the Windows version of Internet Explorer 5.x and greater. With this browser your security settings will prevent saving pages unless you either have enabled ActiveX components to run automatically or after prompting (in which case you will be nagged for permission each time). It is potentially unsafe to allow every site to run any desired components on your computer. The best solution is to add this site to the browser's "Trusted Sites" list. (Tools menu > Internet Options... menu > Security tab, click on the "Trusted sites" icon, then the "Sites" button and add this site to the list. Uncheck "Require server verification..."), then click "Ok". On this site ActiveX is used exclusively to save Web pages. Users with security concerns are encouraged to verify this by inspecting the JavaScript function savepage( ) in the script file BNCresults.js.

Why can't I find common phrases like of course, in spite of?

Such "multiword units" are treated by the BNC's CLAWS parser as single words. Enter them in a single word field and replace the spaces with _ (underscore): of_course. Complete list of multiword units.

Why can't I find contractions like don't, they're or possessives like children's, parents'?

Such "fused forms" are treated by the BNC's CLAWS parser as separate words. Enter each part in a separate word field: do n't, they 're, children 's, parents ' . Note that "altered" forms like won't, ain't are segmented as wo n't, ai n't; the exception can't is segmented can n't', parallel to cannot > can not. Complete list of fused forms.