mirror of
https://github.com/ganelson/inform.git
synced 2024-07-16 22:14:23 +03:00
208 lines
13 KiB
HTML
208 lines
13 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
|
|
<html>
|
|
<head>
|
|
<title>What This Module Does</title>
|
|
<link href="../docs-assets/Breadcrumbs.css" rel="stylesheet" rev="stylesheet" type="text/css">
|
|
<meta name="viewport" content="width=device-width initial-scale=1">
|
|
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
|
|
<meta http-equiv="Content-Language" content="en-gb">
|
|
|
|
<link href="../docs-assets/Contents.css" rel="stylesheet" rev="stylesheet" type="text/css">
|
|
<link href="../docs-assets/Progress.css" rel="stylesheet" rev="stylesheet" type="text/css">
|
|
<link href="../docs-assets/Navigation.css" rel="stylesheet" rev="stylesheet" type="text/css">
|
|
<link href="../docs-assets/Fonts.css" rel="stylesheet" rev="stylesheet" type="text/css">
|
|
<link href="../docs-assets/Base.css" rel="stylesheet" rev="stylesheet" type="text/css">
|
|
<link href="../docs-assets/Colours.css" rel="stylesheet" rev="stylesheet" type="text/css">
|
|
|
|
</head>
|
|
<body class="commentary-font">
|
|
<nav role="navigation">
|
|
<h1><a href="../index.html">
|
|
<img src="../docs-assets/Inform.png" height=72">
|
|
</a></h1>
|
|
<ul><li><a href="../index.html">home</a></li>
|
|
</ul><h2>Compiler</h2><ul>
|
|
<li><a href="../structure.html">structure</a></li>
|
|
<li><a href="../inbuildn.html">inbuild</a></li>
|
|
<li><a href="../inform7n.html">inform7</a></li>
|
|
<li><a href="../intern.html">inter</a></li>
|
|
<li><a href="../services.html">services</a></li>
|
|
<li><a href="../secrets.html">secrets</a></li>
|
|
</ul><h2>Other Tools</h2><ul>
|
|
<li><a href="../inblorbn.html">inblorb</a></li>
|
|
<li><a href="../indocn.html">indoc</a></li>
|
|
<li><a href="../inform6.html">inform6</a></li>
|
|
<li><a href="../inpolicyn.html">inpolicy</a></li>
|
|
<li><a href="../inrtpsn.html">inrtps</a></li>
|
|
</ul><h2>Resources</h2><ul>
|
|
<li><a href="../extensions.html">extensions</a></li>
|
|
<li><a href="../kits.html">kits</a></li>
|
|
</ul><h2>Repository</h2><ul>
|
|
<li><a href="https://github.com/ganelson/inform"><img src="../docs-assets/github.png" height=18> github</a></li>
|
|
</ul><h2>Related Projects</h2><ul>
|
|
<li><a href="../../../inweb/index.html">inweb</a></li>
|
|
<li><a href="../../../intest/index.html">intest</a></li>
|
|
|
|
</ul>
|
|
</nav>
|
|
<main role="main">
|
|
<!--Weave of 'What This Module Does' generated by Inweb-->
|
|
<div class="breadcrumbs">
|
|
<ul class="crumbs"><li><a href="../index.html">Home</a></li><li><a href="../services.html">Services</a></li><li><a href="index.html">lexicon</a></li><li><a href="index.html#P">Preliminaries</a></li><li><b>What This Module Does</b></li></ul></div>
|
|
<p class="purpose">An overview of the lexicon module's role and abilities.</p>
|
|
|
|
<ul class="toc"><li><a href="P-wtmd.html#SP1">§1. Prerequisites</a></li><li><a href="P-wtmd.html#SP2">§2. A symbols table for natural language</a></li><li><a href="P-wtmd.html#SP4">§4. Optimisations</a></li><li><a href="P-wtmd.html#SP6">§6. Performance in practice</a></li></ul><hr class="tocbar">
|
|
|
|
<p class="commentary firstcommentary"><a id="SP1" class="paragraph-anchor"></a><b>§1. Prerequisites. </b>The lexicon module is a part of the Inform compiler toolset. It is
|
|
presented as a literate program or "web". Before diving in:
|
|
</p>
|
|
|
|
<ul class="items"><li>(a) It helps to have some experience of reading webs: see <a href="../../../inweb/index.html" class="internal">inweb</a> for more.
|
|
</li><li>(b) The module is written in C, in fact ANSI C99, but this is disguised by the
|
|
fact that it uses some extension syntaxes provided by the <a href="../../../inweb/index.html" class="internal">inweb</a> literate
|
|
programming tool, making it a dialect of C called InC. See <a href="../../../inweb/index.html" class="internal">inweb</a> for
|
|
full details, but essentially: it's C without predeclarations or header files,
|
|
and where functions have names like <span class="extract"><span class="extract-syntax">Tags::add_by_name</span></span> rather than <span class="extract"><span class="extract-syntax">add_by_name</span></span>.
|
|
</li><li>(c) This module uses other modules drawn from the compiler (see <a href="../structure.html" class="internal">structure</a>), and also
|
|
uses a module of utility functions called <a href="../../../inweb/foundation-module/index.html" class="internal">foundation</a>.
|
|
For more, see <a href="../../../inweb/foundation-module/P-abgtf.html" class="internal">A Brief Guide to Foundation (in foundation)</a>.
|
|
</li></ul>
|
|
<p class="commentary firstcommentary"><a id="SP2" class="paragraph-anchor"></a><b>§2. A symbols table for natural language. </b>This module provides an analogue to the "symbols table" used in a compiler for
|
|
a conventional language. For example, in a C compiler, identifiers such as
|
|
<span class="extract"><span class="extract-syntax">int</span></span>, <span class="extract"><span class="extract-syntax">x</span></span> or <span class="extract"><span class="extract-syntax">printf</span></span> might all be entries in such a table, and any new name
|
|
can rapidly be checked to see if it matches one already known.
|
|
</p>
|
|
|
|
<p class="commentary">In natural language we have "excerpts", that is, contiguous runs of words,
|
|
rather than identifiers. But we must similarly remember their meanings.
|
|
Examples might include:
|
|
</p>
|
|
|
|
<blockquote>
|
|
<p>american dialect, say close bracket, player's command, open, Hall of Mirrors</p>
|
|
</blockquote>
|
|
|
|
<p class="commentary">Conventional symbols table algorithms depend on the fact that identifiers are
|
|
relatively long sequences of letters (often 8 or more units) drawn from a
|
|
small alphabet (say, the 37 letters, digits and the underscore). But Inform
|
|
has short symbols (typically 1 to 3 units) drawn from a huge alphabet (say,
|
|
5,000 different words found in the source text). Inform also allows for
|
|
flexibility in matching: the excerpt meaning <span class="extract"><span class="extract-syntax">give # bone</span></span>, for example, must
|
|
match "give a dog a bone" or "give me the thigh bone".
|
|
</p>
|
|
|
|
<p class="commentary">We also need to parse in ways which a conventional compiler does not. If C has
|
|
registered the identifier <span class="extract"><span class="extract-syntax">pink_martini</span></span>, it never needs to notice <span class="extract"><span class="extract-syntax">martini</span></span> as
|
|
being related to it. But when Inform registers "pink martini" as the name of an
|
|
instance, it then has to spot that either "pink" or "martini" alone might also
|
|
refer to the same object.
|
|
</p>
|
|
|
|
<p class="commentary">Finally, we have to cope with ambiguities. An innocent word like "door" might
|
|
have multiple meanings, and the more so once multi-word flexible patterns
|
|
are involved.
|
|
</p>
|
|
|
|
<p class="commentary firstcommentary"><a id="SP3" class="paragraph-anchor"></a><b>§3. </b>This is not a large module, but it contains tricky and speed-critical code.
|
|
In compensation, it exposes a very simple API to the outside world, all of
|
|
which is found in <a href="1-lxc.html" class="internal">Lexicon (in lexicon)</a>.
|
|
</p>
|
|
|
|
<p class="commentary">The lexicon is stored using <a href="2-em.html#SP1" class="internal">excerpt_meaning</a> objects, in <a href="2-em.html" class="internal">Excerpt Meanings</a>.
|
|
Entries are added with <a href="1-lxc.html#SP1" class="internal">Lexicon::register</a> and retrieved with <a href="1-lxc.html#SP3" class="internal">Lexicon::retrieve</a>.
|
|
</p>
|
|
|
|
<p class="commentary">In either case the user must supply a "meaning code", such as <span class="extract"><span class="extract-syntax">TABLE_MC</span></span>, giving
|
|
a very loose idea of the context; we will use that both to make lookups faster,
|
|
to provide separate namespaces (one can search for just <span class="extract"><span class="extract-syntax">TABLE_MC</span></span> meanings,
|
|
for example), and to control the style of parsing done.
|
|
See <a href="P-htitm.html" class="internal">How To Include This Module (in lexicon)</a>.
|
|
</p>
|
|
|
|
<p class="commentary firstcommentary"><a id="SP4" class="paragraph-anchor"></a><b>§4. Optimisations. </b>This is a speed-critical part of Inform and has been heavily optimised, at the
|
|
cost of some complexity. There are two main ideas:
|
|
</p>
|
|
|
|
<p class="commentary">Firstly, each word in the vocabulary gathered up by the <a href="../words-module/index.html" class="internal">words</a> module —
|
|
i.e., each different word in the source text — has a <a href="2-em.html#SP3" class="internal">vocabulary_lexicon_data</a>
|
|
object attached to it. This in turn contains lists of all known meanings
|
|
starting with, ending with, or simply involving the word.
|
|
</p>
|
|
|
|
<p class="commentary">For example, if "great green dragon" is given a meaning, then this is added to
|
|
the first-word list for "great", the last-word list for "dragon", and the
|
|
middle-word list for "green".
|
|
</p>
|
|
|
|
<p class="commentary">In addition, every word in an excerpt which is not an article adds the meaning
|
|
to its "subset list" — here, that would be all three words, but for "gandalf
|
|
the grey", it would be entered onto the subset lists for "gandalf" and "grey".
|
|
Subset lists tend to be longer and thus slower to deal with, and are used only
|
|
in contexts where it is legal to use a subset of a name to refer to the
|
|
meaning — for example, to say just "Gandalf" but mean the same wizard.
|
|
</p>
|
|
|
|
<p class="commentary firstcommentary"><a id="SP5" class="paragraph-anchor"></a><b>§5. </b>Secondly, recall that each vocabulary entry has a field 32 bits wide for
|
|
a bitmap, of which only 6 bits were used in the lexer. (See <a href="../words-module/2-vcb.html" class="internal">Vocabulary (in words)</a>.)
|
|
For example, cardinal numbers had the <span class="extract"><span class="extract-syntax">NUMBER_MC</span></span> bit set.
|
|
</p>
|
|
|
|
<p class="commentary">We're now going to use the other 26 bits. The idea is that if a meaning is
|
|
registered for the name of, say, a table, then the <span class="extract"><span class="extract-syntax">TABLE_MC</span></span> bit would be
|
|
set for each of the words in that name. For example, if "table of tides" is
|
|
such a name, then each if the words <span class="extract"><span class="extract-syntax">table</span></span>, <span class="extract"><span class="extract-syntax">of</span></span> and <span class="extract"><span class="extract-syntax">tides</span></span> picks up the
|
|
<span class="extract"><span class="extract-syntax">TABLE_MC</span></span> bit.
|
|
</p>
|
|
|
|
<p class="commentary">What we gain by this is that if we are ever testing some words in the source
|
|
text to see if they might be the name of a table, we can immediately reject,
|
|
say, "green table" because the word <span class="extract"><span class="extract-syntax">green</span></span> does not have the <span class="extract"><span class="extract-syntax">TABLE_MC</span></span> bit.
|
|
</p>
|
|
|
|
<p class="commentary">For more on this, and for complications arising to do with case sensitivity,
|
|
see <a href="2-em.html#SP8_1" class="internal">ExcerptMeanings::hash_code_from_token_list</a>.
|
|
</p>
|
|
|
|
<p class="commentary firstcommentary"><a id="SP6" class="paragraph-anchor"></a><b>§6. Performance in practice. </b>The following statistics show how many times the lexicon was used during
|
|
a typical Inform 7 compilation (the same one used to generate the data in
|
|
<a href="../inform7/M-pm.html" class="internal">Performance Metrics (in inform7)</a>).
|
|
</p>
|
|
|
|
<p class="commentary">Optimisation is worthwhile if:
|
|
</p>
|
|
|
|
<ul class="items"><li>● the number of attempts with incorrect hash codes is appreciably larger
|
|
than the number with correct ones
|
|
</li></ul>
|
|
<p class="commentary">Optimisation is efficient if:
|
|
</p>
|
|
|
|
<ul class="items"><li>● the number of attempts with correct hash codes is close to the
|
|
number of successes.
|
|
</li></ul>
|
|
<pre class="undisplayed-code all-displayed-code code-font">
|
|
<span class="plain-syntax">Size of lexicon: 3128 excerpt meanings</span>
|
|
<span class="plain-syntax"> Stored among 845 words out of total vocabulary of 10734</span>
|
|
<span class="plain-syntax"> 715 words have a start list: longest belongs to report (with 293 meanings)</span>
|
|
<span class="plain-syntax"> 15 words have an end list: longest belongs to case (with 6 meanings)</span>
|
|
<span class="plain-syntax"> 29 words have a middle list: longest belongs to to (with 4 meanings)</span>
|
|
<span class="plain-syntax"> 108 words have a subset list: longest belongs to street (with 4 meanings)</span>
|
|
|
|
<span class="plain-syntax">Number of attempts to retrieve: 110342</span>
|
|
<span class="plain-syntax"> of which unsuccessful: 92204</span>
|
|
<span class="plain-syntax"> of which successful: 18138</span>
|
|
|
|
<span class="plain-syntax">Total attempts to match against excerpt meanings: 276470</span>
|
|
<span class="plain-syntax"> of which, total with incorrect hash codes: 253734</span>
|
|
<span class="plain-syntax"> of which, total with correct hash codes: 22736</span>
|
|
<span class="plain-syntax"> of which, total which matched: 19905</span>
|
|
</pre>
|
|
<nav role="progress"><div class="progresscontainer">
|
|
<ul class="progressbar"><li class="progressprevoff">❮</li><li class="progresscurrentchapter">P</li><li class="progresscurrent">wtmd</li><li class="progresssection"><a href="P-htitm.html">htitm</a></li><li class="progresschapter"><a href="1-lm.html">1</a></li><li class="progresschapter"><a href="2-em.html">2</a></li><li class="progressnext"><a href="P-htitm.html">❯</a></li></ul></div>
|
|
</nav><!--End of weave-->
|
|
|
|
</main>
|
|
</body>
|
|
</html>
|
|
|