<!--Weave of 'What This Module Does' generated by Inweb-->
<divclass="breadcrumbs">
<ulclass="crumbs"><li><ahref="../index.html">Home</a></li><li><ahref="../compiler.html">Shared Modules</a></li><li><ahref="index.html">words</a></li><li><ahref="index.html#P">Preliminaries</a></li><li><b>What This Module Does</b></li></ul></div>
<pclass="purpose">An overview of the words module's role and abilities.</p>
<ulclass="toc"><li><ahref="P-wtmd.html#SP1">§1. Prerequisites</a></li><li><ahref="P-wtmd.html#SP2">§2. Words, words, words</a></li><li><ahref="P-wtmd.html#SP5">§5. Meaning codes</a></li><li><ahref="P-wtmd.html#SP6">§6. Contiguous runs of words</a></li><li><ahref="P-wtmd.html#SP7">§7. Hypothetical words</a></li><li><ahref="P-wtmd.html#SP8">§8. Rock, paper, scissors</a></li><li><ahref="P-wtmd.html#SP9">§9. Traditional identifiers</a></li></ul><hrclass="tocbar">
<pclass="commentary firstcommentary"><aid="SP1"></a><b>§1. Prerequisites. </b>The words module is a part of the Inform compiler toolset. It is
presented as a literate program or "web". Before diving in:
</p>
<ulclass="items"><li>(a) It helps to have some experience of reading webs: see <ahref="../../../inweb/docs/index.html"class="internal">inweb</a> for more.
</li><li>(b) The module is written in C, in fact ANSI C99, but this is disguised by the
fact that it uses some extension syntaxes provided by the <ahref="../../../inweb/docs/index.html"class="internal">inweb</a> literate
programming tool, making it a dialect of C called InC. See <ahref="../../../inweb/docs/index.html"class="internal">inweb</a> for
full details, but essentially: it's C without predeclarations or header files,
and where functions have names like <spanclass="extract"><spanclass="extract-syntax">Tags::add_by_name</span></span> rather than <spanclass="extract"><spanclass="extract-syntax">add_by_name</span></span>.
</li><li>(c) This module uses other modules drawn from the <ahref="../compiler.html"class="internal">compiler</a>, and also
uses a module of utility functions called <ahref="../../../inweb/docs/foundation-module/index.html"class="internal">foundation</a>.
For more, see <ahref="../../../inweb/docs/foundation-module/P-abgtf.html"class="internal">A Brief Guide to Foundation (in foundation)</a>.
<pclass="commentary firstcommentary"><aid="SP2"></a><b>§2. Words, words, words. </b>Natural language text for use with Inform begins as text files written by
human users, which are fed into the "lexer" (i.e., lexical analyser).
The function <ahref="3-tff.html#SP2"class="internal">TextFromFiles::feed_open_file_into_lexer</a> reads such a file,
converting it to a numbered stream of words. For indexing and error reporting
purposes, we must not forget where these words came from: the function returns
a <ahref="3-tff.html#SP1"class="internal">source_file</a> object representing the file as an origin, and the lexer
assigns each word a <ahref="3-lxr.html#SP3"class="internal">source_location</a> which is simply its SF together with
a line number. <ahref="3-lxr.html#SP18"class="internal">Lexer::word_location</a> returns this for a given word number.
</p>
<pclass="commentary">Word numbers count upwards from 1 and are contiguous: for example —
<pclass="commentary">Repetitions are frequent: a typical source text of 50,000 words has an
unquoted<supid="fnref:1"><ahref="#fn:1"rel="footnote">1</a></sup> vocabulary of only about 2000 different words. Inform generates
a <ahref="2-vcb.html#SP1"class="internal">vocabulary_entry</a> object for each of these distinct words, and <ahref="3-lxr.html#SP18"class="internal">Lexer::word</a>
returns the VE for a given word number. In the above example,
<spanclass="plain-syntax"></span><spanclass="function-syntax">Lexer::word</span><spanclass="plain-syntax">(17) == </span><spanclass="function-syntax">Lexer::word</span><spanclass="plain-syntax">(25) </span><spanclass="comment-syntax"> both uses of "Mary"</span>
<spanclass="plain-syntax"></span><spanclass="function-syntax">Lexer::word</span><spanclass="plain-syntax">(21) == </span><spanclass="function-syntax">Lexer::word</span><spanclass="plain-syntax">(29) </span><spanclass="comment-syntax"> both uses of "lamb"</span>
<spanclass="plain-syntax"></span><spanclass="function-syntax">Lexer::word</span><spanclass="plain-syntax">(20) != </span><spanclass="function-syntax">Lexer::word</span><spanclass="plain-syntax">(24) </span><spanclass="comment-syntax"> one is "little", the other "that"</span>
</pre>
<pclass="commentary">The important point is that words at two positions can be tested for textual
equality in an essentially instant process, by comparing <spanclass="extract"><spanclass="extract-syntax">vocabulary_entry *</span></span>
pointers.
</p>
<pclass="commentary">Nothing in life is free, and building the vocabulary efficiently is itself a
challenge: see <ahref="2-vcb.html#SP13"class="internal">Vocabulary::hash_code_from_word</a>. The key function is
<ahref="2-vcb.html#SP15"class="internal">Vocabulary::entry_for_text</a>, which takes a wide C string for a word and
returns its <ahref="2-vcb.html#SP1"class="internal">vocabulary_entry</a>. There are also issues with casing: in
general we want "Lamb" and "lamb" to match, but not always.
</p>
<ulclass="footnotetexts"><liclass="footnote"id="fn:1"><pclass="inwebfootnote"><supid="fnref:1"><ahref="#fn:1"rel="footnote">1</a></sup> A piece of text in double-quotes is treated as a single word by the lexer,
although <ahref="../inform7/index.html"class="internal">inform7</a> may later unroll text substitutions in it, calling the
lexer again to do that.
<ahref="#fnref:1"title="return to text">↩</a></p></li></ul>
<pclass="commentary firstcommentary"><aid="SP3"></a><b>§3. </b>A few <ahref="2-vcb.html#SP1"class="internal">vocabulary_entry</a> objects are hardwired into <ahref="index.html"class="internal">words</a>, but only
for punctuation. These have names like <spanclass="extract"><spanclass="extract-syntax">COMMA_V</span></span>, which means just what you
<spanclass="plain-syntax"></span><spanclass="function-syntax">Lexer::word</span><spanclass="plain-syntax">(27) == </span><spanclass="identifier-syntax">COMMA_V</span><spanclass="plain-syntax"></span><spanclass="comment-syntax"> the comma between "went" and "the"</span>
</pre>
<pclass="commentary firstcommentary"><aid="SP4"></a><b>§4. </b>Lexical errors occur if words are too long, or quoted text continues without
a close quote right to the end of a file, and so on. These are sent to the
function <ahref="3-lxr.html#SP29"class="internal">Lexer::lexer_problem_handler</a>, but can be intercepted by the
user (see <ahref="P-htitm.html"class="internal">How To Include This Module</a>).
</p>
<pclass="commentary firstcommentary"><aid="SP5"></a><b>§5. Meaning codes. </b>Each <ahref="2-vcb.html#SP1"class="internal">vocabulary_entry</a> has a bitmap of <spanclass="extract"><spanclass="extract-syntax">*_MC</span></span> meaning codes assigned to it.
(And <ahref="2-vcb.html#SP10"class="internal">Vocabulary::test_flags</a> tests whether the Nth word has a given bit.)
For example, <spanclass="extract"><spanclass="extract-syntax">ORDINAL_MC</span></span> is applied to ordinal numbers like "sixth" or "15th"
— see <ahref="2-vcb.html#SP17"class="internal">Vocabulary::an_ordinal_number</a>, and <spanclass="extract"><spanclass="extract-syntax">NUMBER_MC</span></span> to cardinals. The
<ahref="index.html"class="internal">words</a> module uses only a few bits in this map, but the <ahref="../linguistics-module/index.html"class="internal">linguistics</a>
module develops the idea much further: for example, any word which can be used
in a particular semantic category — say, in a variable name — is marked
with a bit representing that — say, <spanclass="extract"><spanclass="extract-syntax">VARIABLE_MC</span></span>. The <ahref="../core-module/index.html"class="internal">core</a> module
uses this for 15 or so of the most commonly used semantic categories in the
Inform language. See <ahref="../linguistics-module/P-wtmd.html"class="internal">What This Module Does (in linguistics)</a> to pick up the story.
</p>
<pclass="commentary firstcommentary"><aid="SP6"></a><b>§6. Contiguous runs of words. </b>Natural languages are fundamentally unlike programming languages because a noun
referring to, say, a variable is rarely a single lexical token. In C, a variable
name like <spanclass="extract"><spanclass="extract-syntax">selected_lamb</span></span> is one lexical unit. For us, though, "a little lamb"
is three words.
</p>
<pclass="commentary">However, multi-word snippets of text which have a joint meaning are almost
always contiguous. The text "a little lamb" is word numbers 19, 20, 21. We
deal with this using the <ahref="3-wrd.html#SP2"class="internal">wording</a> type: it's essentially a pair of integers,
<spanclass="extract"><spanclass="extract-syntax">(19, 21)</span></span>, and thus is very quick to form, compare, copy and pass as a
parameter. <ahref="3-wrd.html"class="internal">Wordings</a> provides an extensive API for this.
</p>
<pclass="commentary firstcommentary"><aid="SP7"></a><b>§7. Hypothetical words. </b>Sometimes Inform needs to make hypothetical passages of text. For example,
suppose there is a kind called "paint colour" in the source text; Inform may
then want to create a variable called "paint colour understood". But this text
may not occur as such anywhere in the source.
</p>
<pclass="commentary">If all the words needed are in the source somewhere, but not together, the user
of the <ahref="index.html"class="internal">words</a> module has two options:
</p>
<ulclass="items"><li>● Create a <ahref="2-wa.html#SP2"class="internal">word_assemblage</a> object. This can represent any discontiguous
list of word numbers: thus, the text "lamb went everywhere" could be a WA
of numbers (21, 26, 23) in our example above.
</li><li>● Use <ahref="3-lxr.html#SP28"class="internal">Lexer::splice_words</a> to create duplicate snippets of text in the
word stream, with new numbers. For example, call this on "lamb", then "went",
then "everywhere"; the three new word numbers will then be contiguous, and
can be represented by a <ahref="3-wrd.html#SP2"class="internal">wording</a>:
<pclass="commentary">If however we want to make "lamb tian with haricot beans", we need to use the
Lexer's ability to read text internally as well as from external files. This
is called a "feed": see <ahref="3-fds.html"class="internal">Feeds</a>. In particular, <ahref="3-fds.html#SP4"class="internal">Feeds::feed_text</a> will
take the text <spanclass="extract"><spanclass="extract-syntax">I"tian with haricot beans"</span></span>, treat this as fresh text for
<pclass="commentary">These new words do not originate in a file; their <ahref="3-lxr.html#SP3"class="internal">source_location</a> therefore
has a null <ahref="3-tff.html#SP1"class="internal">source_file</a>. Words which have been spliced, however, and thus
duplicated in the word stream (like "lamb went everywhere", 30-32), retain
their original origins.
</p>
<pclass="commentary firstcommentary"><aid="SP8"></a><b>§8. Rock, paper, scissors. </b>We now have three ways to represent text which may contain multiple words:
as a <spanclass="extract"><spanclass="extract-syntax">text_stream</span></span>, as a <spanclass="extract"><spanclass="extract-syntax">wording</span></span>, as a <spanclass="extract"><spanclass="extract-syntax">word_assemblage</span></span>. Each can be
converted into the other two:
</p>
<ulclass="items"><li>● Use <ahref="3-fds.html#SP4"class="internal">Feeds::feed_text</a> to turn a <spanclass="extract"><spanclass="extract-syntax">text_stream</span></span> to a <spanclass="extract"><spanclass="extract-syntax">wording</span></span>.
</li><li>● Use <ahref="2-wa.html#SP4"class="internal">WordAssemblages::from_wording</a> to turn a <spanclass="extract"><spanclass="extract-syntax">wording</span></span> to a <spanclass="extract"><spanclass="extract-syntax">word_assemblage</span></span>.
</li><li>● Use <ahref="2-wa.html#SP7"class="internal">WordAssemblages::to_wording</a> to turn a <spanclass="extract"><spanclass="extract-syntax">word_assemblage</span></span> to a <spanclass="extract"><spanclass="extract-syntax">wording</span></span>.
</li><li>● Use <ahref="3-wrd.html#SP22"class="internal">Wordings::writer</a> or use the formatted <spanclass="extract"><spanclass="extract-syntax">WRITE</span></span> escape <spanclass="extract"><spanclass="extract-syntax">%W</span></span> to
write a <spanclass="extract"><spanclass="extract-syntax">wording</span></span> into a <spanclass="extract"><spanclass="extract-syntax">text_stream</span></span>.
</li><li>● Use <ahref="2-wa.html#SP9"class="internal">WordAssemblages::writer</a> or use the formatted <spanclass="extract"><spanclass="extract-syntax">WRITE</span></span> escape <spanclass="extract"><spanclass="extract-syntax">%A</span></span> to
write a <spanclass="extract"><spanclass="extract-syntax">word_assemblage</span></span> into a <spanclass="extract"><spanclass="extract-syntax">text_stream</span></span>.
</li></ul>
<pclass="commentary">As a general design goal, all Inform code uses <ahref="3-wrd.html#SP2"class="internal">wording</a> to identify names
of things: this is fastest and most efficient on memory.
</p>
<pclass="commentary firstcommentary"><aid="SP9"></a><b>§9. Traditional identifiers. </b>Imagine you're a compiler turning natural language into some sort of computer
code, just hypothetically: then you probably want "a little lamb" to come out
as a named location in memory, or object, or something like that: and this name
must be a valid identifier for some other compiler or assembler — alphanumeric,
not too long, and so on. Calling it "a little lamb" is not an option.
</p>
<pclass="commentary">You could of course name it <spanclass="extract"><spanclass="extract-syntax">ref_15A40F</span></span>, or some such, because the user will
never see it anyway, so why have a helpful name? But that won't make debugging
your output easy. The function <ahref="3-idn.html#SP3"class="internal">Identifiers::compose</a> therefore takes a
wording and a unique ID number and makes something sensible: <spanclass="extract"><spanclass="extract-syntax">I15_a_little_lamb</span></span>,