inform7/docs/lexicon-module/P-wtmd.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
	<head>
		<title>What This Module Does</title>
<link href="../docs-assets/Breadcrumbs.css" rel="stylesheet" rev="stylesheet" type="text/css">
		<meta name="viewport" content="width=device-width initial-scale=1">
		<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
		<meta http-equiv="Content-Language" content="en-gb">

<link href="../docs-assets/Contents.css" rel="stylesheet" rev="stylesheet" type="text/css">
<link href="../docs-assets/Progress.css" rel="stylesheet" rev="stylesheet" type="text/css">
<link href="../docs-assets/Navigation.css" rel="stylesheet" rev="stylesheet" type="text/css">
<link href="../docs-assets/Fonts.css" rel="stylesheet" rev="stylesheet" type="text/css">
<link href="../docs-assets/Base.css" rel="stylesheet" rev="stylesheet" type="text/css">
<link href="../docs-assets/Colours.css" rel="stylesheet" rev="stylesheet" type="text/css">

	</head>
	<body class="commentary-font">
		<nav role="navigation">
		<h1><a href="../index.html">
<img src="../docs-assets/Inform.png" height=72">
</a></h1>
<ul><li><a href="../index.html">home</a></li>
</ul><h2>Compiler</h2><ul>
<li><a href="../structure.html">structure</a></li>
<li><a href="../inbuildn.html">inbuild</a></li>
<li><a href="../inform7n.html">inform7</a></li>
<li><a href="../intern.html">inter</a></li>
<li><a href="../services.html">services</a></li>
<li><a href="../secrets.html">secrets</a></li>
</ul><h2>Other Tools</h2><ul>
<li><a href="../inblorbn.html">inblorb</a></li>
<li><a href="../indocn.html">indoc</a></li>
<li><a href="../inform6.html">inform6</a></li>
<li><a href="../inpolicyn.html">inpolicy</a></li>
<li><a href="../inrtpsn.html">inrtps</a></li>
</ul><h2>Resources</h2><ul>
<li><a href="../extensions.html">extensions</a></li>
<li><a href="../kits.html">kits</a></li>
</ul><h2>Repository</h2><ul>
<li><a href="https://github.com/ganelson/inform"><img src="../docs-assets/github.png" height=18> github</a></li>
</ul><h2>Related Projects</h2><ul>
<li><a href="../../../inweb/index.html">inweb</a></li>
<li><a href="../../../intest/index.html">intest</a></li>

</ul>
		</nav>
		<main role="main">
		<!--Weave of 'What This Module Does' generated by Inweb-->
<div class="breadcrumbs">
    <ul class="crumbs"><li><a href="../index.html">Home</a></li><li><a href="../services.html">Services</a></li><li><a href="index.html">lexicon</a></li><li><a href="index.html#P">Preliminaries</a></li><li><b>What This Module Does</b></li></ul></div>
<p class="purpose">An overview of the lexicon module's role and abilities.</p>

<ul class="toc"><li><a href="P-wtmd.html#SP1">&#167;1. Prerequisites</a></li><li><a href="P-wtmd.html#SP2">&#167;2. A symbols table for natural language</a></li><li><a href="P-wtmd.html#SP4">&#167;4. Optimisations</a></li><li><a href="P-wtmd.html#SP6">&#167;6. Performance in practice</a></li></ul><hr class="tocbar">

<p class="commentary firstcommentary"><a id="SP1" class="paragraph-anchor"></a><b>&#167;1. Prerequisites. </b>The lexicon module is a part of the Inform compiler toolset. It is
presented as a literate program or "web". Before diving in:
</p>

<ul class="items"><li>(a) It helps to have some experience of reading webs: see <a href="../../../inweb/index.html" class="internal">inweb</a> for more.
</li><li>(b) The module is written in C, in fact ANSI C99, but this is disguised by the
fact that it uses some extension syntaxes provided by the <a href="../../../inweb/index.html" class="internal">inweb</a> literate
programming tool, making it a dialect of C called InC. See <a href="../../../inweb/index.html" class="internal">inweb</a> for
full details, but essentially: it's C without predeclarations or header files,
and where functions have names like <span class="extract"><span class="extract-syntax">Tags::add_by_name</span></span> rather than <span class="extract"><span class="extract-syntax">add_by_name</span></span>.
</li><li>(c) This module uses other modules drawn from the compiler (see <a href="../structure.html" class="internal">structure</a>), and also
uses a module of utility functions called <a href="../../../inweb/foundation-module/index.html" class="internal">foundation</a>.
For more, see <a href="../../../inweb/foundation-module/P-abgtf.html" class="internal">A Brief Guide to Foundation (in foundation)</a>.
</li></ul>
<p class="commentary firstcommentary"><a id="SP2" class="paragraph-anchor"></a><b>&#167;2. A symbols table for natural language. </b>This module provides an analogue to the "symbols table" used in a compiler for
a conventional language. For example, in a C compiler, identifiers such as
<span class="extract"><span class="extract-syntax">int</span></span>, <span class="extract"><span class="extract-syntax">x</span></span> or <span class="extract"><span class="extract-syntax">printf</span></span> might all be entries in such a table, and any new name
can rapidly be checked to see if it matches one already known.
</p>

<p class="commentary">In natural language we have "excerpts", that is, contiguous runs of words,
rather than identifiers. But we must similarly remember their meanings.
Examples might include:
</p>

<blockquote>
    <p>american dialect, say close bracket, player's command, open, Hall of Mirrors</p>
</blockquote>

<p class="commentary">Conventional symbols table algorithms depend on the fact that identifiers are
relatively long sequences of letters (often 8 or more units) drawn from a
small alphabet (say, the 37 letters, digits and the underscore). But Inform
has short symbols (typically 1 to 3 units) drawn from a huge alphabet (say,
5,000 different words found in the source text). Inform also allows for
flexibility in matching: the excerpt meaning <span class="extract"><span class="extract-syntax">give # bone</span></span>, for example, must
match "give a dog a bone" or "give me the thigh bone".
</p>

<p class="commentary">We also need to parse in ways which a conventional compiler does not. If C has
registered the identifier <span class="extract"><span class="extract-syntax">pink_martini</span></span>, it never needs to notice <span class="extract"><span class="extract-syntax">martini</span></span> as
being related to it. But when Inform registers "pink martini" as the name of an
instance, it then has to spot that either "pink" or "martini" alone might also
refer to the same object.
</p>

<p class="commentary">Finally, we have to cope with ambiguities. An innocent word like "door" might
have multiple meanings, and the more so once multi-word flexible patterns
are involved.
</p>

<p class="commentary firstcommentary"><a id="SP3" class="paragraph-anchor"></a><b>&#167;3.  </b>This is not a large module, but it contains tricky and speed-critical code.
In compensation, it exposes a very simple API to the outside world, all of
which is found in <a href="1-lxc.html" class="internal">Lexicon (in lexicon)</a>.
</p>

<p class="commentary">The lexicon is stored using <a href="2-em.html#SP1" class="internal">excerpt_meaning</a> objects, in <a href="2-em.html" class="internal">Excerpt Meanings</a>.
Entries are added with <a href="1-lxc.html#SP1" class="internal">Lexicon::register</a> and retrieved with <a href="1-lxc.html#SP3" class="internal">Lexicon::retrieve</a>.
</p>

<p class="commentary">In either case the user must supply a "meaning code", such as <span class="extract"><span class="extract-syntax">TABLE_MC</span></span>, giving
a very loose idea of the context; we will use that both to make lookups faster,
to provide separate namespaces (one can search for just <span class="extract"><span class="extract-syntax">TABLE_MC</span></span> meanings,
for example), and to control the style of parsing done.
See <a href="P-htitm.html" class="internal">How To Include This Module (in lexicon)</a>.
</p>

<p class="commentary firstcommentary"><a id="SP4" class="paragraph-anchor"></a><b>&#167;4. Optimisations. </b>This is a speed-critical part of Inform and has been heavily optimised, at the
cost of some complexity. There are two main ideas:
</p>

<p class="commentary">Firstly, each word in the vocabulary gathered up by the <a href="../words-module/index.html" class="internal">words</a> module &mdash;
i.e., each different word in the source text &mdash; has a <a href="2-em.html#SP3" class="internal">vocabulary_lexicon_data</a>
object attached to it. This in turn contains lists of all known meanings
starting with, ending with, or simply involving the word.
</p>

<p class="commentary">For example, if "great green dragon" is given a meaning, then this is added to
the first-word list for "great", the last-word list for "dragon", and the
middle-word list for "green".
</p>

<p class="commentary">In addition, every word in an excerpt which is not an article adds the meaning
to its "subset list" &mdash; here, that would be all three words, but for "gandalf
the grey", it would be entered onto the subset lists for "gandalf" and "grey".
Subset lists tend to be longer and thus slower to deal with, and are used only
in contexts where it is legal to use a subset of a name to refer to the
meaning &mdash; for example, to say just "Gandalf" but mean the same wizard.
</p>

<p class="commentary firstcommentary"><a id="SP5" class="paragraph-anchor"></a><b>&#167;5.  </b>Secondly, recall that each vocabulary entry has a field 32 bits wide for
a bitmap, of which only 6 bits were used in the lexer. (See <a href="../words-module/2-vcb.html" class="internal">Vocabulary (in words)</a>.)
For example, cardinal numbers had the <span class="extract"><span class="extract-syntax">NUMBER_MC</span></span> bit set.
</p>

<p class="commentary">We're now going to use the other 26 bits. The idea is that if a meaning is
registered for the name of, say, a table, then the <span class="extract"><span class="extract-syntax">TABLE_MC</span></span> bit would be
set for each of the words in that name. For example, if "table of tides" is
such a name, then each if the words <span class="extract"><span class="extract-syntax">table</span></span>, <span class="extract"><span class="extract-syntax">of</span></span> and <span class="extract"><span class="extract-syntax">tides</span></span> picks up the
<span class="extract"><span class="extract-syntax">TABLE_MC</span></span> bit.
</p>

<p class="commentary">What we gain by this is that if we are ever testing some words in the source
text to see if they might be the name of a table, we can immediately reject,
say, "green table" because the word <span class="extract"><span class="extract-syntax">green</span></span> does not have the <span class="extract"><span class="extract-syntax">TABLE_MC</span></span> bit.
</p>

<p class="commentary">For more on this, and for complications arising to do with case sensitivity,
see <a href="2-em.html#SP8_1" class="internal">ExcerptMeanings::hash_code_from_token_list</a>.
</p>

<p class="commentary firstcommentary"><a id="SP6" class="paragraph-anchor"></a><b>&#167;6. Performance in practice. </b>The following statistics show how many times the lexicon was used during
a typical Inform 7 compilation (the same one used to generate the data in
<a href="../inform7/M-pm.html" class="internal">Performance Metrics (in inform7)</a>).
</p>

<p class="commentary">Optimisation is worthwhile if:
</p>

<ul class="items"><li>&#9679; the number of attempts with incorrect hash codes is appreciably larger
than the number with correct ones
</li></ul>
<p class="commentary">Optimisation is efficient if:
</p>

<ul class="items"><li>&#9679; the number of attempts with correct hash codes is close to the
number of successes.
</li></ul>
<pre class="undisplayed-code all-displayed-code code-font">
<span class="plain-syntax">Size of lexicon: 3128 excerpt meanings</span>
<span class="plain-syntax">  Stored among 845 words out of total vocabulary of 10734</span>
<span class="plain-syntax">  715 words have a start list: longest belongs to report (with 293 meanings)</span>
<span class="plain-syntax">  15 words have an end list: longest belongs to case (with 6 meanings)</span>
<span class="plain-syntax">  29 words have a middle list: longest belongs to to (with 4 meanings)</span>
<span class="plain-syntax">  108 words have a subset list: longest belongs to street (with 4 meanings)</span>

<span class="plain-syntax">Number of attempts to retrieve: 110342</span>
<span class="plain-syntax">  of which unsuccessful: 92204</span>
<span class="plain-syntax">  of which successful: 18138</span>

<span class="plain-syntax">Total attempts to match against excerpt meanings: 276470</span>
<span class="plain-syntax">  of which, total with incorrect hash codes: 253734</span>
<span class="plain-syntax">  of which, total with correct hash codes: 22736</span>
<span class="plain-syntax">  of which, total which matched: 19905</span>
</pre>
<nav role="progress"><div class="progresscontainer">
    <ul class="progressbar"><li class="progressprevoff">&#10094;</li><li class="progresscurrentchapter">P</li><li class="progresscurrent">wtmd</li><li class="progresssection"><a href="P-htitm.html">htitm</a></li><li class="progresschapter"><a href="1-lm.html">1</a></li><li class="progresschapter"><a href="2-em.html">2</a></li><li class="progressnext"><a href="P-htitm.html">&#10095;</a></li></ul></div>
</nav><!--End of weave-->

		</main>
	</body>
</html>