mirror of
https://github.com/ganelson/inform.git
synced 2024-07-08 01:54:21 +03:00
167 lines
8.8 KiB
OpenEdge ABL
167 lines
8.8 KiB
OpenEdge ABL
What This Module Does.
|
|
|
|
An overview of the words module's role and abilities.
|
|
|
|
@h Prerequisites.
|
|
The words module is a part of the Inform compiler toolset. It is
|
|
presented as a literate program or "web". Before diving in:
|
|
(a) It helps to have some experience of reading webs: see //inweb// for more.
|
|
(b) The module is written in C, in fact ANSI C99, but this is disguised by the
|
|
fact that it uses some extension syntaxes provided by the //inweb// literate
|
|
programming tool, making it a dialect of C called InC. See //inweb// for
|
|
full details, but essentially: it's C without predeclarations or header files,
|
|
and where functions have names like |Tags::add_by_name| rather than |add_by_name|.
|
|
(c) This module uses other modules drawn from the compiler (see //structure//), and also
|
|
uses a module of utility functions called //foundation//.
|
|
For more, see //foundation: A Brief Guide to Foundation//.
|
|
|
|
@h Words, words, words.
|
|
Natural language text for use with Inform begins as text files written by
|
|
human users, which are fed into the "lexer" (i.e., lexical analyser).
|
|
The function //TextFromFiles::feed_open_file_into_lexer// reads such a file,
|
|
converting it to a numbered stream of words. For indexing and error reporting
|
|
purposes, we must not forget where these words came from: the function returns
|
|
a //source_file// object representing the file as an origin, and the lexer
|
|
assigns each word a //source_location// which is simply its SF together with
|
|
a line number. //Lexer::word_location// returns this for a given word number.
|
|
|
|
Word numbers count upwards from 1 and are contiguous: for example --
|
|
= (text)
|
|
Mary had a little lamb . Everywhere that Mary went , the lamb
|
|
17 18 19 20 21 22 23 24 25 26 27 28 29
|
|
=
|
|
Repetitions are frequent: a typical source text of 50,000 words has an
|
|
unquoted[1] vocabulary of only about 2000 different words. Inform generates
|
|
a //vocabulary_entry// object for each of these distinct words, and //Lexer::word//
|
|
returns the VE for a given word number. In the above example,
|
|
= (text as InC)
|
|
Lexer::word(17) == Lexer::word(25) /* both are uses of "Mary" */
|
|
Lexer::word(21) == Lexer::word(29) /* both are uses of "lamb" */
|
|
Lexer::word(20) != Lexer::word(24) /* one is "little", the other "that" */
|
|
=
|
|
The important point is that words at two positions can be tested for textual
|
|
equality in an essentially instant process, by comparing |vocabulary_entry *|
|
|
pointers. (See //Numbered Words// for just this sort of comparison.)
|
|
|
|
Nothing in life is free, and building the vocabulary efficiently is itself a
|
|
challenge: see //Vocabulary::hash_code_from_word//. The key function is
|
|
//Vocabulary::entry_for_text//, which takes a wide C string for a word and
|
|
returns its //vocabulary_entry//. There are also issues with casing: in
|
|
general we want "Lamb" and "lamb" to match, but not always.
|
|
|
|
[1] A piece of text in double-quotes is treated as a single word by the lexer,
|
|
although //inform7// may later unroll text substitutions in it, calling the
|
|
lexer again to do that.
|
|
|
|
@ A few //vocabulary_entry// objects are hardwired into //words//, but only
|
|
for punctuation. These have names like |COMMA_V|, which means just what you
|
|
think it means. In our example,
|
|
= (text as InC)
|
|
Lexer::word(27) == COMMA_V /* the comma between "went" and "the" */
|
|
=
|
|
See //Vocabulary::create_punctuation//, and also //LoadPreform::create_punctuation//,
|
|
where further punctuation marks are created in order to parse Preform syntax --
|
|
there are exotica such as |COLONCOLONEQUALS_V| there, for "::=".
|
|
|
|
@ Lexical errors occur if words are too long, or quoted text continues without
|
|
a close quote right to the end of a file, and so on. These are sent to the
|
|
function //Lexer::lexer_problem_handler//, but can be intercepted by the
|
|
user (see //How To Include This Module//).
|
|
|
|
@h Meaning codes.
|
|
Each //vocabulary_entry// has a bitmap of |*_MC| meaning codes assigned to it.
|
|
(And //Vocabulary::test_flags// tests whether the Nth word has a given bit.)
|
|
For example, |ORDINAL_MC| is applied to ordinal numbers like "sixth" or "15th"
|
|
-- see //Vocabulary::an_ordinal_number//, and |NUMBER_MC| to cardinals. The
|
|
//words// module uses only a few bits in this map, but the //linguistics//
|
|
module develops the idea much further: for example, any word which can be used
|
|
in a particular semantic category -- say, in a variable name -- is marked
|
|
with a bit representing that -- say, |VARIABLE_MC|. The //core// module
|
|
uses this for 15 or so of the most commonly used semantic categories in the
|
|
Inform language. See //linguistics: What This Module Does// to pick up the story.
|
|
|
|
@h Contiguous runs of words.
|
|
Natural languages are fundamentally unlike programming languages because a noun
|
|
referring to, say, a variable is rarely a single lexical token. In C, a variable
|
|
name like |selected_lamb| is one lexical unit. For us, though, "a little lamb"
|
|
is three words.
|
|
|
|
However, multi-word snippets of text which have a joint meaning are almost
|
|
always contiguous. The text "a little lamb" is word numbers 19, 20, 21. We
|
|
deal with this using the //wording// type: it's essentially a pair of integers,
|
|
|(19, 21)|, and thus is very quick to form, compare, copy and pass as a
|
|
parameter. //Wordings// provides an extensive API for this.
|
|
|
|
@h Hypothetical words.
|
|
Sometimes Inform needs to make hypothetical passages of text. For example,
|
|
suppose there is a kind called "paint colour" in the source text; Inform may
|
|
then want to create a variable called "paint colour understood". But this text
|
|
may not occur as such anywhere in the source.
|
|
|
|
If all the words needed are in the source somewhere, but not together, the user
|
|
of the //words// module has two options:
|
|
|
|
(*) Create a //word_assemblage// object. This can represent any discontiguous
|
|
list of word numbers: thus, the text "lamb went everywhere" could be a WA
|
|
of numbers (21, 26, 23) in our example above.
|
|
(*) Use //Lexer::splice_words// to create duplicate snippets of text in the
|
|
word stream, with new numbers. For example, call this on "lamb", then "went",
|
|
then "everywhere"; the three new word numbers will then be contiguous, and
|
|
can be represented by a //wording//:
|
|
= (text)
|
|
Mary had a little lamb . Everywhere that Mary went , the lamb lamb went everywhere
|
|
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
|
|
=
|
|
|
|
If however we want to make "lamb tian with haricot beans", we need to use the
|
|
Lexer's ability to read text internally as well as from external files. This
|
|
is called a "feed": see //Feeds//. In particular, //Feeds::feed_text// will
|
|
take the text |I"tian with haricot beans"|, treat this as fresh text for
|
|
lexing so that we now have
|
|
= (text)
|
|
... , the lamb lamb went everywhere tian with haricot beans
|
|
... 27 28 29 30 31 32 34 35 36 37
|
|
=
|
|
and now the word assemblage (21, 34, 35, 36, 37) would indeed represent "lamb
|
|
tian with haricot beans". The return value of //Feeds::feed_text// is the
|
|
//wording// (34, 37).
|
|
|
|
These new words do not originate in a file; their //source_location// therefore
|
|
has a null //source_file//. Words which have been spliced, however, and thus
|
|
duplicated in the word stream (like "lamb went everywhere", 30-32), retain
|
|
their original origins.
|
|
|
|
@h Rock, paper, scissors.
|
|
We now have three ways to represent text which may contain multiple words:
|
|
as a |text_stream|, as a |wording|, as a |word_assemblage|. Each can be
|
|
converted into the other two:
|
|
|
|
(*) Use //Feeds::feed_text// to turn a |text_stream| to a |wording|.
|
|
(*) Use //WordAssemblages::from_wording// to turn a |wording| to a |word_assemblage|.
|
|
(*) Use //WordAssemblages::to_wording// to turn a |word_assemblage| to a |wording|.
|
|
(*) Use //Wordings::writer// or use the formatted |WRITE| escape |%W| to
|
|
write a |wording| into a |text_stream|.
|
|
(*) Use //WordAssemblages::writer// or use the formatted |WRITE| escape |%A| to
|
|
write a |word_assemblage| into a |text_stream|.
|
|
|
|
As a general design goal, all Inform code uses //wording// to identify names
|
|
of things: this is fastest and most efficient on memory.
|
|
|
|
@h Traditional identifiers.
|
|
Imagine you're a compiler turning natural language into some sort of computer
|
|
code, just hypothetically: then you probably want "a little lamb" to come out
|
|
as a named location in memory, or object, or something like that: and this name
|
|
must be a valid identifier for some other compiler or assembler -- alphanumeric,
|
|
not too long, and so on. Calling it "a little lamb" is not an option.
|
|
|
|
You could of course name it |ref_15A40F|, or some such, because the user will
|
|
never see it anyway, so why have a helpful name? But that won't make debugging
|
|
your output easy. The function //Identifiers::compose// therefore takes a
|
|
wording and a unique ID number and makes something sensible: |I15_a_little_lamb|,
|
|
say.
|
|
|
|
@h Preform.
|
|
Preform is a meta-language for writing a simple grammar: it's in some sense
|
|
pre-Inform, because it defines the Inform language itself. See //About Preform//,
|
|
where the story told in the present section continues...
|