<ulclass="crumbs"><li><ahref="../webs.html">Source</a></li><li><ahref="../compiler.html">Compiler Modules</a></li><li><ahref="index.html">words</a></li><li><ahref="index.html#3">Chapter 3: Words in Sequence</a></li><li><b>Lexer</b></li></ul><pclass="purpose">To break down a stream of characters into a numbered sequence of words, literal strings and literal I6 inclusions, removing comments and unnecessary whitespace.</p>
<ulclass="toc"><li><ahref="#SP1">§1. Definitions</a></li><li><ahref="#SP5">§5. The lexical structure of source text</a></li><li><ahref="#SP9">§9. What the lexer stores for each word</a></li><li><ahref="#SP15">§15. External lexer states</a></li><li><ahref="#SP16">§16. Definition of punctuation</a></li><li><ahref="#SP17">§17. Definition of indentation</a></li><li><ahref="#SP18">§18. Access functions</a></li><li><ahref="#SP19">§19. Definition of white space</a></li><li><ahref="#SP20">§20. Internal lexer states</a></li><li><ahref="#SP24">§24. Feeding the lexer</a></li><li><ahref="#SP26">§26. Lexing one character at a time</a></li><li><ahref="#SP26_1">§26.1. Dealing with whitespace</a></li><li><ahref="#SP26_5">§26.5. Completing a word</a></li><li><ahref="#SP26_6">§26.6. Entering and leaving literal mode</a></li><li><ahref="#SP26_8">§26.8. Breaking strings up at text substitutions</a></li><li><ahref="#SP28">§28. Splicing</a></li><li><ahref="#SP29">§29. Basic command-line error handler</a></li></ul><hrclass="tocbar">
<spanclass="reserved">struct</span><spanclass="plain"></span><spanclass="reserved">source_file</span><spanclass="plain"> *</span><spanclass="identifier">file_of_origin</span><spanclass="plain">; </span><spanclass="comment">or <codeclass="display"><spanclass="extract">NULL</span></code> if internally written and not from a file</span>
<spanclass="reserved">int</span><spanclass="plain"></span><spanclass="identifier">line_number</span><spanclass="plain">; </span><spanclass="comment">counting upwards from 1 within file (if any)</span>
<pclass="endnote">The structure source_location is accessed in 3/tff, 3/fds and here.</p>
<pclass="inwebparagraph"><aid="SP4"></a><b>§4. </b>A word can be an English word such as <codeclass="display"><spanclass="extract">bedspread</span></code>, or a piece of punctuation
such as <codeclass="display"><spanclass="extract">!</span></code>, or a number such as <codeclass="display"><spanclass="extract">127</span></code>, or a piece of quoted text of arbitrary
size such as <codeclass="display"><spanclass="extract">"I summon up remembrance of things past"</span></code>.
</p>
<pclass="inwebparagraph">The words found are numbered 0, 1, 2, ... in order of being read by
the lexer. The first eight or so words come from the mandatory insertion
text (see Read Source Text.w), then come the words from the primary source
text, then those from the extensions loaded.
</p>
<pclass="inwebparagraph">References to text throughout NI's data structure are often in the form
of a pair of word numbers, usually called <codeclass="display"><spanclass="extract">w1</span></code> and <codeclass="display"><spanclass="extract">w2</span></code> or some variation
on that, indicating the text which starts at word <codeclass="display"><spanclass="extract">w1</span></code> and finishes
at <codeclass="display"><spanclass="extract">w2</span></code> (including both ends). Thus if the text is
</p>
<blockquote>
<p>When to the sessions of sweet silent thought</p>
</blockquote>
<pclass="inwebparagraph">then the eight words are numbered 0 to 7 and a reference to <codeclass="display"><spanclass="extract">w1=2</span></code>, <codeclass="display"><spanclass="extract">w2=5</span></code>
would mean the sub-text "the sessions of sweet". The special null value
<codeclass="display"><spanclass="extract">wn=-1</span></code> is used when no word reference has been made: never 0, as that would
mean the first word in the list. The maximum legal word number is always one
<spanclass="reserved">int</span><spanclass="plain"></span><spanclass="identifier">lexer_wordcount</span><spanclass="plain">; </span><spanclass="comment">Number of words read in to arrays</span>
<pclass="inwebparagraph"><aid="SP5"></a><b>§5. The lexical structure of source text. </b>The following definitions are fairly self-evident: they specify which
characters cause word divisions, or signal literals.
<spanclass="definitionkeyword">define</span><spanclass="constant">STRING_BEGIN</span><spanclass="plain"></span><spanclass="character">'"'</span><spanclass="plain"></span><spanclass="comment">Strings are always double-quoted</span>
<spanclass="definitionkeyword">define</span><spanclass="constant">TEXT_SUBSTITUTION_BEGIN</span><spanclass="plain"></span><spanclass="character">'['</span><spanclass="plain"></span><spanclass="comment">Inside strings, this denotes a text substitution</span>
<spanclass="definitionkeyword">define</span><spanclass="constant">COMMENT_BEGIN</span><spanclass="plain"></span><spanclass="character">'['</span><spanclass="plain"></span><spanclass="comment">Text between these, outside strings, is comment</span>
<spanclass="definitionkeyword">define</span><spanclass="constant">INFORM6_ESCAPE_BEGIN_1</span><spanclass="plain"></span><spanclass="character">'('</span><spanclass="plain"></span><spanclass="comment">Text beginning with this pair is literal I6 code</span>
<spanclass="definitionkeyword">define</span><spanclass="constant">PARAGRAPH_BREAK</span><spanclass="plain"></span><spanclass="identifier">L</span><spanclass="string">"|__"</span><spanclass="plain"></span><spanclass="comment">Inserted as a special word to mark paragraph breaks</span>
<spanclass="definitionkeyword">define</span><spanclass="constant">UNICODE_CHAR_IN_STRING</span><spanclass="plain"> ((</span><spanclass="identifier">wchar_t</span><spanclass="plain">) </span><spanclass="constant">0x1b</span><spanclass="plain">) </span><spanclass="comment">To represent awkward characters in metadata only</span>
<spanclass="definitionkeyword">define</span><spanclass="constant">STANDARD_PUNCTUATION_MARKS</span><spanclass="plain"></span><spanclass="identifier">L</span><spanclass="string">".,:;?!(){}[]"</span><spanclass="plain"></span><spanclass="comment">Do not add to this list lightly!</span>
<pclass="inwebparagraph"><aid="SP7"></a><b>§7. </b>This seems a good point to describe how best to syntax-colour source
text, something which the user interfaces do on every platform. By
convention we are sparing with the colours: ordinary word-processing
is not a kaleidoscopic experience (even when Microsoft Word's impertinent
grammar checker is accidentally left switched on), and we want the experience
of writing Inform source text to be like writing, not like programming.
So we use just a little colour, and that goes a long way.
</p>
<pclass="inwebparagraph">Because the Inform applications generally syntax-colour source text in the
Source panel of the user interface, it is probably worth writing down the
lexical specification. There are eight basic categories of text, and
they should be detected in the following order, with the first category
that applies being the one to determine the colour and/or font weight:
</p>
<pclass="inwebparagraph"></p>
<ulclass="items"><li>(1) Titling text (primary source text only: not found in extensions).
If the first non-whitespace in the file is a double-quoted text (see (4a)),
this is the title of the work.
</li></ul>
<ulclass="items"><li>(2) Documentation text (extension text only: not found in primary source).
If a paragraph consists of a single non-whitespace token only, and that
token is <codeclass="display"><spanclass="extract">----</span></code> (four hyphens in a row), then this paragraph and all
subsequent text down to the bottom of the file.
</li></ul>
<ulclass="items"><li>(3) Heading text. If a paragraph consists of a single line only and which
begins with one of the five words Volume, Book, Part, Chapter or Section,
capitalised as here, then that paragraph is a heading. (A paragraph
division is found at the start and end of a file, and also at any run
of white space containing two or more newline characters: a newline
can be any of the Unicode characters <codeclass="display"><spanclass="extract">0x000A</span></code>, <codeclass="display"><spanclass="extract">0x2028</span></code> or <codeclass="display"><spanclass="extract">0x2029</span></code>.)
</li></ul>
<ulclass="items"><li>(4a) Quoted text. Outside of (4b) and (4c), a double-quotation mark
(in principle any of Unicode <codeclass="display"><spanclass="extract">0x0022</span></code>, <codeclass="display"><spanclass="extract">0x201C</span></code>, <codeclass="display"><spanclass="extract">0x201D</span></code>) begins
quoted text provided it follows either whitespace, or the start of
the file, or one of the punctuation marks in the <codeclass="display"><spanclass="extract">PUNCTUATION_MARKS</span></code>
string defined above. Quoted text continues until the next
double-quotation mark (or the end of the file if there isn't one,
though NI would issue Problems if asked to compile this).
</li></ul>
<ulclass="items"><li>(4a1) Text substitution text. Within (4a) only, an open square bracket
introduced text substitution matter which continues until the next
close square bracket or the end of the quoted text. (Again, NI would
issue problem messages if given a string malformed in this way.)
</li></ul>
<ulclass="items"><li>(4b) Comment text. Outside of (4a) and (4c), an open square bracket begins
comment. Comment continues until the next matching close square
bracket. (This is the case even if that is in double quotes within the
comment, i.e., quotation marks should be ignored when matching <codeclass="display"><spanclass="extract">[</span></code> and <codeclass="display"><spanclass="extract">]</span></code>
inside a comment.) Thus, nested comments are allowed, and the following
text contains a single comment running from just after "the" through to
the full stop:
</li></ul>
<blockquote>
<p>|Snow White and the [Seven Dwarfs [but not Doc]].|</p>
</blockquote>
<ulclass="items"><li>(4c) Literal I6 code. Outside of (4a) and (4b), the combination <codeclass="display"><spanclass="extract">(-</span></code> begins
literal I6 matter. This matter continues until the next <codeclass="display"><spanclass="extract">-)</span></code> is reached.
Within literal I6 matter, one can escape back into I7 source text using a
matched pair of <codeclass="display"><spanclass="extract">(+</span></code> and <codeclass="display"><spanclass="extract">+)</span></code> tokens, but it really doesn't seem worth
syntax colouring this very much. And the authors of Inform will lose no
sleep if we miscolour this, for instance, especially if it deters people
from such horrible coding practices:
</li></ul>
<blockquote>
<p>|(- Constant BLOB = (+ the total weight of things in (- selfobj -) +); -)|</p>
</blockquote>
<ulclass="items"><li>(5) Normal text. Everything else.
</li></ul>
<pclass="inwebparagraph">NI regards all of the Unicode characters <codeclass="display"><spanclass="extract">0x0009</span></code>, <codeclass="display"><spanclass="extract">0x000A</span></code>, <codeclass="display"><spanclass="extract">0x000D</span></code>,
<codeclass="display"><spanclass="extract">0x0020</span></code>, <codeclass="display"><spanclass="extract">0x0085</span></code>, <codeclass="display"><spanclass="extract">0x00A0</span></code>, <codeclass="display"><spanclass="extract">0x02000</span></code> to <codeclass="display"><spanclass="extract">0x200A</span></code>, <codeclass="display"><spanclass="extract">0x2028</span></code> and <codeclass="display"><spanclass="extract">0x2029</span></code>
as instances of white space. Of course, it's entirely open to the Inform
user interfaces to not allow the user to key some of these codes, but
we should bear in mind that projects using them might be created on one
platform and then reopened on another one, so it's probably best to be
careful.
</p>
<pclass="inwebparagraph"><aid="SP8"></a><b>§8. </b>These categories of text are conventionally displayed as follows:
</p>
<pclass="inwebparagraph"></p>
<ulclass="items"><li>(1) Titling text: black boldface.
<ulclass="items"><li>(3) Heading text: black boldface, perhaps of a slightly larger point
size.
</li></ul>
<ulclass="items"><li>(4a) Quoted text: dark blue boldface.
</li></ul>
<ulclass="items"><li>(4a1) Text substitution text: lighter blue and not boldface.
</li></ul>
<ulclass="items"><li>(4b) Comment text: darkish green type, perhaps of a slightly smaller point
size.
</li></ul>
<ulclass="items"><li>(4c) Literal I6 code: grey type. (Inform for OS X rather coolly goes into
I6 syntax-colouring, which is considerably harder, for this material:
see "The Inform 6 Technical Manual" for an algorithm.)
</li></ul>
<ulclass="items"><li>(5) Normal text: black type.
</li></ul>
<pclass="inwebparagraph"><aid="SP9"></a><b>§9. What the lexer stores for each word. </b>The lexer builds a small data structure for each individual word it reads.
<spanclass="identifier">wchar_t</span><spanclass="plain"> *</span><spanclass="identifier">lw_text</span><spanclass="plain">; </span><spanclass="comment">text of word after treatment to normalise</span>
<spanclass="identifier">wchar_t</span><spanclass="plain"> *</span><spanclass="identifier">lw_rawtext</span><spanclass="plain">; </span><spanclass="comment">original untouched text of word</span>
<spanclass="reserved">struct</span><spanclass="plain"></span><spanclass="reserved">source_location</span><spanclass="plain"></span><spanclass="identifier">lw_source</span><spanclass="plain">; </span><spanclass="comment">where it was read from</span>
<spanclass="reserved">int</span><spanclass="plain"></span><spanclass="identifier">lexer_details_memory_allocated</span><spanclass="plain"> = </span><spanclass="constant">0</span><spanclass="plain">; </span><spanclass="comment">bytes allocated to this array</span>
<spanclass="reserved">int</span><spanclass="plain"></span><spanclass="identifier">lexer_workspace_allocated</span><spanclass="plain"> = </span><spanclass="constant">0</span><spanclass="plain">; </span><spanclass="comment">bytes allocated to text storage</span>
<spanclass="definitionkeyword">define</span><spanclass="constant">MAX_VERBATIM_LENGTH</span><spanclass="plain"></span><spanclass="constant">200000</span><spanclass="plain"></span><spanclass="comment">Largest quantity of Inform 6 which can be quoted verbatim.</span>
<spanclass="definitionkeyword">define</span><spanclass="constant">MAX_WORD_LENGTH</span><spanclass="plain"></span><spanclass="constant">128</span><spanclass="plain"></span><spanclass="comment">Maximum length of any unquoted word</span>
<spanclass="identifier">wchar_t</span><spanclass="plain"> *</span><spanclass="identifier">lexer_workspace</span><spanclass="plain">; </span><spanclass="comment">Large area of contiguous memory for text</span>
<spanclass="identifier">wchar_t</span><spanclass="plain"> *</span><spanclass="identifier">lexer_word</span><spanclass="plain">; </span><spanclass="comment">Start of current word in workspace</span>
<spanclass="identifier">wchar_t</span><spanclass="plain"> *</span><spanclass="identifier">lexer_hwm</span><spanclass="plain">; </span><spanclass="comment">High water mark of workspace</span>
<spanclass="identifier">wchar_t</span><spanclass="plain"> *</span><spanclass="identifier">lexer_workspace_end</span><spanclass="plain">; </span><spanclass="comment">Pointer to just past the end of the workspace: HWM must not exceed this</span>
<spanclass="functiontext">Lexer::ensure_space_up_to</span><spanclass="plain">(50000); </span><spanclass="comment">the Standard Rules are about 44,000 words</span>
<pclass="endnote">The function Lexer::ensure_space_up_to is used in <ahref="#SP11">§11</a>, <ahref="#SP26_5_2">§26.5.2</a>, <ahref="#SP28">§28</a>.</p>
<pclass="inwebparagraph"><aid="SP13"></a><b>§13. </b>Inform would almost certainly crash if we wrote past the end of the
workspace, so we need to watch for the water running high. The following
routine checks that there is room for another <codeclass="display"><spanclass="extract">n</span></code> characters, plus a
termination character, plus breathing space for a single character's worth
<pclass="endnote">The function Lexer::copy_to_memory is used in 4/nw (<ahref="4-nw.html#SP8">§8</a>).</p>
<pclass="inwebparagraph"><aid="SP15"></a><b>§15. External lexer states. </b>The lexer is a finite state machine at heart. Its current state is the
collective value of an extensive set of variables, almost all of them
flags, but with three exceptions this state is used only within the lexer.
</p>
<pclass="inwebparagraph">The three exceptional modes are by default both off and by default they
stay off: the lexer never goes into either mode by itself.
</p>
<pclass="inwebparagraph"><codeclass="display"><spanclass="extract">lexer_divide_strings_at_text_substitutions</span></code> is used by some of the lexical writing-back
machinery, when it has been decided to compile something like
</p>
<blockquote>
<p>say "[The noun] falls onto [the second noun]."</p>
</blockquote>
<pclass="inwebparagraph">In its ordinary mode, with this setting off, the lexer will render this as
two words, the second being the entire quoted text. But if
<codeclass="display"><spanclass="extract">lexer_divide_strings_at_text_substitutions</span></code> is set then the text is reinterpreted as
</p>
<blockquote>
<p>say The noun, " falls onto ", the second noun, "."</p>
</blockquote>
<pclass="inwebparagraph">which runs to eleven words, three of them commas (punctuation always counts
as a word).
</p>
<pclass="inwebparagraph"><codeclass="display"><spanclass="extract">lexer_wait_for_dashes</span></code> is set by the extension-reading machinery, in
cases where it wants to get at the documentation text of an extension but
does not want to have to fill NI's memory with the source text of its code.
In this mode, the lexer ignores the whole stream of words until it reaches
<codeclass="display"><spanclass="extract">----</span></code>, the special marker used in extensions to divide source text from
documentation: it then drops out of this mode and back into normal running,
<spanclass="reserved">int</span><spanclass="plain"></span><spanclass="identifier">lexer_divide_strings_at_text_substitutions</span><spanclass="plain">; </span><spanclass="comment">Break up text substitutions in quoted text</span>
<spanclass="reserved">int</span><spanclass="plain"></span><spanclass="identifier">lexer_allow_I6_escapes</span><spanclass="plain">; </span><spanclass="comment">Recognise <codeclass="display"><spanclass="extract">(-</span></code> and <codeclass="display"><spanclass="extract">-)</span></code></span>
<spanclass="reserved">int</span><spanclass="plain"></span><spanclass="identifier">lexer_wait_for_dashes</span><spanclass="plain">; </span><spanclass="comment">Ignore all text until first <codeclass="display"><spanclass="extract">----</span></code> found</span>
<pclass="inwebparagraph"><aid="SP16"></a><b>§16. Definition of punctuation. </b>As we have seen, the question of whether something is a punctuation mark
<pclass="endnote">The function Lexer::is_punctuation is used in <ahref="#SP25">§25</a>, 3/tff (<ahref="3-tff.html#SP4">§4</a>).</p>
<pclass="inwebparagraph"><aid="SP17"></a><b>§17. Definition of indentation. </b>We're going to record the level of indentation in the "break" character.
We will recognise anything from 1 to 25 tabs as distinct indentation amounts;
a value of 26 means "26 or more", and at such sizes, indentation isn't
distinguished. We'll do this with the letters <codeclass="display"><spanclass="extract">A</span></code> to <codeclass="display"><spanclass="extract">Z</span></code>.
<spanclass="reserved">if</span><spanclass="plain"> (</span><spanclass="identifier">wn</span><spanclass="plain"><</span><spanclass="constant">0</span><spanclass="plain">) </span><spanclass="identifier">internal_error</span><spanclass="plain">(</span><spanclass="string">"can't set word location"</span><spanclass="plain">);</span>
<pclass="endnote">The function Lexer::set_word is used in 2/vcb (<ahref="2-vcb.html#SP4">§4</a>).</p>
<pclass="endnote">The function Lexer::break_before is used in 3/wrd (<ahref="3-wrd.html#SP21">§21</a>), 4/nw (<ahref="4-nw.html#SP4">§4</a>).</p>
<pclass="endnote">The function Lexer::file_of_origin appears nowhere else.</p>
<pclass="endnote">The function Lexer::word_location is used in 3/wrd (<ahref="3-wrd.html#SP11">§11</a>).</p>
<pclass="endnote">The function Lexer::set_word_location appears nowhere else.</p>
<pclass="endnote">The function Lexer::set_word_raw_text is used in 2/vcb (<ahref="2-vcb.html#SP5">§5</a>), 4/nw (<ahref="4-nw.html#SP8">§8</a>).</p>
<pclass="endnote">The function Lexer::word_text is used in 2/vcb (<ahref="2-vcb.html#SP4">§4</a>, <ahref="2-vcb.html#SP7">§7</a>), 3/wrd (<ahref="3-wrd.html#SP17">§17</a>), 3/tff (<ahref="3-tff.html#SP4">§4</a>), 3/idn (<ahref="3-idn.html#SP3">§3</a>), 4/nw (<ahref="4-nw.html#SP2">§2</a>, <ahref="4-nw.html#SP8">§8</a>).</p>
<pclass="endnote">The function Lexer::set_word_text is used in 2/vcb (<ahref="2-vcb.html#SP5">§5</a>), 4/nw (<ahref="4-nw.html#SP8">§8</a>).</p>
<pclass="endnote">The function Lexer::word_copy is used in <ahref="#SP28">§28</a>.</p>
<pclass="inwebparagraph"><aid="SP19"></a><b>§19. Definition of white space. </b>The following macro (to save time over a function call) is highly dangerous,
and of the kind which all books on C counsel against. If it were called with
any argument whose evaluation had side-effects, disaster would ensue.
It is therefore used only twice, with care, and only in this section below.
<pclass="inwebparagraph"><aid="SP20"></a><b>§20. Internal lexer states. </b>The current situation of the lexer is specified by the collective values
of all of the following. First, the start of the current word being
recorded, and the current high water mark — those are defined above.
Second, we need the feeder machinery to maintain a variable telling us
the previous character in the raw, un-respaced source. We need to be a
little careful about the type of this: it needs to be an <codeclass="display"><spanclass="extract">int</span></code> so that it
can on occasion hold the pseudo-character value <codeclass="display"><spanclass="extract">EOF</span></code>.
<spanclass="reserved">int</span><spanclass="plain"></span><spanclass="identifier">lxs_previous_char_in_raw_feed</span><spanclass="plain">; </span><spanclass="comment">Preceding character in raw file read</span>
<spanclass="reserved">int</span><spanclass="plain"></span><spanclass="identifier">lxs_kind_of_word</span><spanclass="plain">; </span><spanclass="comment">One of the defined values above</span>
<spanclass="reserved">int</span><spanclass="plain"></span><spanclass="identifier">lxs_literal_mode</span><spanclass="plain">; </span><spanclass="comment">Are we in literal or ordinary mode?</span>
<spanclass="reserved">int</span><spanclass="plain"></span><spanclass="identifier">lxs_most_significant_space_char</span><spanclass="plain">; </span><spanclass="comment">Most significant whitespace character preceding</span>
<spanclass="reserved">int</span><spanclass="plain"></span><spanclass="identifier">lxs_number_of_tab_stops</span><spanclass="plain">; </span><spanclass="comment">Number of consecutive tabs</span>
<spanclass="reserved">int</span><spanclass="plain"></span><spanclass="identifier">lxs_this_line_is_empty_so_far</span><spanclass="plain">; </span><spanclass="comment">Current line white space so far?</span>
<spanclass="reserved">int</span><spanclass="plain"></span><spanclass="identifier">lxs_this_word_is_empty_so_far</span><spanclass="plain">; </span><spanclass="comment">Looking for a word to start?</span>
<spanclass="reserved">int</span><spanclass="plain"></span><spanclass="identifier">lxs_scanning_text_substitution</span><spanclass="plain">; </span><spanclass="comment">Used to break up strings at [substitutions]</span>
<spanclass="reserved">int</span><spanclass="plain"></span><spanclass="identifier">lxs_comment_nesting</span><spanclass="plain">; </span><spanclass="comment">For square brackets within square brackets</span>
<spanclass="reserved">int</span><spanclass="plain"></span><spanclass="identifier">lxs_string_soak_up_spaces_mode</span><spanclass="plain">; </span><spanclass="comment">Used to fold strings which break across lines</span>
<spanclass="identifier">lxs_most_significant_space_char</span><spanclass="plain"> = </span><spanclass="character">'\n'</span><spanclass="plain">; </span><spanclass="comment">we imagine each lexer feed starting a new line</span>
<spanclass="identifier">lxs_number_of_tab_stops</span><spanclass="plain"> = </span><spanclass="constant">0</span><spanclass="plain">; </span><spanclass="comment">but not yet indented with tabs</span>
<spanclass="identifier">LOGIF</span><spanclass="plain">(</span><spanclass="identifier">LEXICAL_OUTPUT</span><spanclass="plain">, </span><spanclass="string">"Lexer feed began at %d\n"</span><spanclass="plain">, </span><spanclass="identifier">lexer_feed_started_at</span><spanclass="plain">);</span>
<<spanclass="cwebmacro">Issue Problem messages if feed ended in the middle of quoted text, comment or verbatim I6</span><spanclass="cwebmacronumber">24.3</span>><spanclass="plain">;</span>
<pclass="endnote">The function Lexer::feed_begins is used in 3/tff (<ahref="3-tff.html#SP2">§2</a>), 3/fds (<ahref="3-fds.html#SP5">§5</a>).</p>
<pclass="endnote">The function Lexer::feed_ends is used in 3/tff (<ahref="3-tff.html#SP2">§2</a>), 3/fds (<ahref="3-fds.html#SP5">§5</a>).</p>
<pclass="inwebparagraph"><aid="SP24_1"></a><b>§24.1. </b>White space padding guarantees that a word running right up to the end of
the feed will be processed, since (outside literal mode) that white space
signals to the lexer that a word is complete. (If we are in literal mode at
the end of the feed, problem messages are produced. We code NI to ensure
that this never occurs when feeding our own C strings through.)
</p>
<pclass="inwebparagraph">At the end of each complete file, we also want to ensure there is always a
paragraph break, because this simplifies the parsing of headings (which in
turn is because a file boundary counts as a super-heading-break, and headings
are only detected as stand-alone paragraphs). We add a bit more white
space than is strictly necessary, because it saves worrying about whether
it is safe to look ahead to characters further on in the lexer's workspace
when we are close to the high water mark, and because it means that a source
file which is empty or contains only a byte-order marker comes out as at
least one paragraph, even if a blank one.
</p>
<pclass="macrodefinition"><codeclass="display">
<<spanclass="cwebmacrodefn">Feed whitespace as padding</span><spanclass="cwebmacronumber">24.1</span>> =
<pclass="endnote">This code is used in <ahref="#SP24">§24</a>.</p>
<pclass="inwebparagraph"><aid="SP24_2"></a><b>§24.2. </b>These problem messages can, of course, never result from text which NI
is feeding into the lexer itself, independently of source files. That would
be a bug, and NI is bug-free, so it follows that it could never happen.
</p>
<preclass="definitions">
<spanclass="definitionkeyword">enum</span><spanclass="constant">MEMORY_OUT_LEXERERROR</span><spanclass="definitionkeyword"> from </span><spanclass="constant">0</span>
<<spanclass="cwebmacrodefn">Issue Problem messages if feed ended in the middle of quoted text, comment or verbatim I6</span><spanclass="cwebmacronumber">24.3</span>> =
<pclass="endnote">This code is used in <ahref="#SP24">§24</a>.</p>
<pclass="inwebparagraph"><aid="SP25"></a><b>§25. </b>The feeder routine is required to send us a triple each time: <codeclass="display"><spanclass="extract">cr</span></code>
must be a valid character (see above) and may not be <codeclass="display"><spanclass="extract">EOF</span></code>; <codeclass="display"><spanclass="extract">last_cr</span></code> must
be the previous one or else perhaps <codeclass="display"><spanclass="extract">EOF</span></code> at the start of feed;
while <codeclass="display"><spanclass="extract">next_cr</span></code> must be the next or else perhaps <codeclass="display"><spanclass="extract">EOF</span></code> at the end of feed.
</p>
<pclass="inwebparagraph">Spaces, often redundant, are inserted around punctuation unless one of the
following exceptions holds:
</p>
<pclass="inwebparagraph">The lexer is in literal mode (inside strings, for instance);
</p>
<pclass="inwebparagraph">Where a single punctuation mark occurs in between two digits, or between
a digit and a minus sign, or (in the case of full stops) between two lower-case
alphanumeric characters. This is done so that, for instance, "0.91" does
not split into three words in the lexer. We do not count square brackets
here, because if we did, that would cause trouble in parsing
</p>
<blockquote>
<p>say "[if M is less than 10]0[otherwise]1";</p>
</blockquote>
<pclass="inwebparagraph">where the <codeclass="display"><spanclass="extract">0]0</span></code> would go unbroken in <codeclass="display"><spanclass="extract">lexer_divide_strings_at_text_substitutions</span></code>
mode, and therefore the <codeclass="display"><spanclass="extract">]</span></code> would remain glued to the preceding text;
</p>
<pclass="inwebparagraph">Where the character following is a slash. (This is done essentially to make
<spanclass="functiontext">Lexer::feed_char_into_lexer</span><spanclass="plain">(</span><spanclass="identifier">cr</span><spanclass="plain">); </span><spanclass="comment">which might take us into literal mode, so to be careful...</span>
<pclass="endnote">The function Lexer::feed_triplet is used in 3/tff (<ahref="3-tff.html#SP2">§2</a>), 3/fds (<ahref="3-fds.html#SP5">§5</a>).</p>
<pclass="inwebparagraph"><aid="SP26"></a><b>§26. Lexing one character at a time. </b>We can think of characters as a stream of differently-coloured marbles,
flowing from various sources into a hopper above our marble-sorting
machine. The hopper lets the marbles drop through one at a time into the
mechanism below, but inserts transparent glass marbles of its own on either
side of certain colours of marble, so that the sequence of marbles entering
the mechanism is no longer the same as that which entered the hopper.
Moreover, the mechanism can itself cause extra marbles of its choice to
drop in from time to time, further interrupting the original flow.
</p>
<pclass="inwebparagraph">The following routine is the mechanism which receives the marbles. We want
the marbles to run swiftly through and either be pulverised to glass
powder, or dropped into the output bucket, as the mechanism chooses.
(Whatever marbles from the original source survive will always emerge in
their original order, though.) Every so often the mechanism decides that it
has completed one batch, and moves on to dropping marbles into the next
bucket.
</p>
<pclass="inwebparagraph">The marbles are characters; transparent glass ones are whitespace, which
will always now be <codeclass="display"><spanclass="extract">' '</span></code>, <codeclass="display"><spanclass="extract">'\t'</span></code> or <codeclass="display"><spanclass="extract">'\n'</span></code>; the routine
<codeclass="display"><spanclass="extract">Lexer::feed_triplet</span></code> above was the hopper; the routine
<codeclass="display"><spanclass="extract">Lexer::feed_char_into_lexer</span></code>, which occupies the whole of the rest of this
section, is the mechanism which takes each marble in turn. (On occasion it
calls itself recursively to cause extra characters of its choice to drop
in.) The batches are words, and the bucket receiving the surviving marbles
is the sequence of characters starting at <codeclass="display"><spanclass="extract">lexer_word</span></code> and extending to
<<spanclass="cwebmacro">Force string division at the start of a text substitution, if necessary</span><spanclass="cwebmacronumber">26.8</span>><spanclass="plain">;</span>
<<spanclass="cwebmacro">Soak up whitespace around line breaks inside a literal string</span><spanclass="cwebmacronumber">26.4</span>><spanclass="plain">;</span>
<spanclass="plain">}</span>
<spanclass="plain">}</span>
<spanclass="comment">whitespace outside literal mode ends any partly built word and need not be recorded</span>
<<spanclass="cwebmacro">Admire the texture of the whitespace</span><spanclass="cwebmacronumber">26.1</span>><spanclass="plain">;</span>
<spanclass="reserved">if</span><spanclass="plain"> (</span><spanclass="identifier">lexer_word</span><spanclass="plain"> != </span><spanclass="identifier">lexer_hwm</span><spanclass="plain">) </span><<spanclass="cwebmacro">Complete the current word</span><spanclass="cwebmacronumber">26.5</span>><spanclass="plain">;</span>
<<spanclass="cwebmacro">Force string division at the end of a text substitution, if necessary</span><spanclass="cwebmacronumber">26.9</span>><spanclass="plain">;</span>
<<spanclass="cwebmacro">Look at recent whitespace to see what break it followed</span><spanclass="cwebmacronumber">26.2</span>><spanclass="plain">;</span>
<pclass="endnote">The function Lexer::feed_char_into_lexer is used in <ahref="#SP24_1">§24.1</a>, <ahref="#SP25">§25</a>, <ahref="#SP26_3">§26.3</a>, <ahref="#SP26_8">§26.8</a>, <ahref="#SP26_9">§26.9</a>.</p>
<pclass="inwebparagraph"><aid="SP26_1"></a><b>§26.1. Dealing with whitespace. </b>Let's deal with the different textures of whitespace first, as these are
surprisingly rich all by themselves.
</p>
<pclass="inwebparagraph">The following keeps track of the biggest white space character it has seen
of late, ranking newlines bigger than tabs, which are in turn bigger than
spaces; and it counts up the number of tabs it has seen (cancelling
back to none if a newline is found).
</p>
<pclass="macrodefinition"><codeclass="display">
<<spanclass="cwebmacrodefn">Admire the texture of the whitespace</span><spanclass="cwebmacronumber">26.1</span>> =
<spanclass="functiontext">Lexer::break_char_for_indents</span><spanclass="plain">(</span><spanclass="identifier">lxs_number_of_tab_stops</span><spanclass="plain">); </span><spanclass="comment">newline followed by 1 or more tabs</span>
<spanclass="identifier">lxs_most_significant_space_char</span><spanclass="plain"> = </span><spanclass="character">' '</span><spanclass="plain">; </span><spanclass="comment">waiting for the next run of whitespace, after this word</span>
<pclass="endnote">This code is used in <ahref="#SP26">§26</a>.</p>
<pclass="inwebparagraph"><aid="SP26_5"></a><b>§26.5. Completing a word. </b>Outside of whitespace, then, our word (whatever it was — ordinary word,
literal string, I6 insertion or comment) has been stored character by
character at the steadily rising high water mark. We have now hit the end
by reaching whitespace (in the case of a literal, this has happened because
we found the end of the literal, escaped literal mode, and then hit
whitespace). The start of the word is at <codeclass="display"><spanclass="extract">lexer_word</span></code>; the last character
is stored just below <codeclass="display"><spanclass="extract">lexer_hwm</span></code>.
</p>
<pclass="macrodefinition"><codeclass="display">
<<spanclass="cwebmacrodefn">Complete the current word</span><spanclass="cwebmacronumber">26.5</span>> =
<spanclass="plain">*</span><spanclass="identifier">lexer_hwm</span><spanclass="plain">++ = </span><spanclass="constant">0</span><spanclass="plain">; </span><spanclass="comment">terminate the current word as a C string</span>
<spanclass="identifier">lexer_wait_for_dashes</span><spanclass="plain"> = </span><spanclass="identifier">FALSE</span><spanclass="plain">; </span><spanclass="comment">our long wait for documentation is over</span>
<<spanclass="cwebmacro">Issue problem message and truncate if over maximum length for what it is</span><spanclass="cwebmacronumber">26.5.1</span>><spanclass="plain">;</span>
<<spanclass="cwebmacro">Store everything about the word except its break, which we already know</span><spanclass="cwebmacronumber">26.5.2</span>><spanclass="plain">;</span>
<spanclass="plain">}</span>
<spanclass="comment">now get ready for what we expect by default to be an ordinary word next</span>
<<spanclass="cwebmacrodefn">Issue problem message and truncate if over maximum length for what it is</span><spanclass="cwebmacronumber">26.5.1</span>> =
<spanclass="identifier">lexer_word</span><spanclass="plain">[</span><spanclass="identifier">max_len</span><spanclass="plain">] = </span><spanclass="constant">0</span><spanclass="plain">; </span><spanclass="comment">truncate to its maximum length</span>
<spanclass="identifier">lexer_word</span><spanclass="plain">[100] = </span><spanclass="constant">0</span><spanclass="plain">; </span><spanclass="comment">to avoid an absurdly long problem message</span>
<pclass="endnote">This code is used in <ahref="#SP26_5">§26.5</a>.</p>
<pclass="inwebparagraph"><aid="SP26_5_2"></a><b>§26.5.2. </b>We recorded the break for the word when it started (recall that, even if
the current word is a literal, its first character was read outside literal
mode, so it started out in life as an ordinary word and therefore had its
break recorded). So now we need to set everything else about it, and to
increment the word-count. We must not allow this to reach its maximum,
since this would allow the next word's break setting to overwrite the
array.
</p>
<pclass="inwebparagraph">For ordinary words (but not literals), the copy of a word in the main array
<codeclass="display"><spanclass="extract">lw_text</span></code> is lowered in case. The original is preserved in <codeclass="display"><spanclass="extract">lw_rawtext</span></code> and
is used to print more attractive error messages, and also to enable a few
semantic parts of NI to be case sensitive. This copying means that in the
worst case — when we complete an ordinary word of maximal length — we need
to consume an additional <codeclass="display"><spanclass="extract">MAX_WORD_LENGTH+2</span></code> bytes of the lexer's workspace,
which is why that was the amount we checked to ensure existed when the
lexer was called. The lowering loop can therefore never overspill the
workspace.
</p>
<pclass="macrodefinition"><codeclass="display">
<<spanclass="cwebmacrodefn">Store everything about the word except its break, which we already know</span><spanclass="cwebmacronumber">26.5.2</span>> =
<pclass="endnote">This code is used in <ahref="#SP26_5">§26.5</a>.</p>
<pclass="inwebparagraph"><aid="SP26_6"></a><b>§26.6. Entering and leaving literal mode. </b>After a character has been stored, in ordinary mode, we see if it
provokes us into entering literal mode, by signifying the start of a
comment, string or passage of verbatim Inform 6.
</p>
<pclass="inwebparagraph">In the case of a string, we positively want to keep the opening character
just recorded as part of the word: it's the opening double-quote mark.
In the case of a comment, we don't care, as we're going to throw it away
anyhow; as it happens, we keep it for now. But in the case of an I6
escape we are in danger, because of the auto-spacing around brackets, of
recording two words
</p>
<blockquote>
<p>|( -something|</p>
</blockquote>
<pclass="inwebparagraph">when in fact we want to record
</p>
<blockquote>
<p>|(- something|</p>
</blockquote>
<pclass="inwebparagraph">We do this by adding a hyphen to the previous word (the <codeclass="display"><spanclass="extract">(</span></code> word), and by
throwing away the hyphen from the material of the current word.
<spanclass="comment">because of spacing around punctuation outside literal mode, the <codeclass="display"><spanclass="extract">(</span></code> became a word</span>
<spanclass="reserved">if</span><spanclass="plain"> (</span><spanclass="identifier">lexer_wordcount</span><spanclass="plain">></span><spanclass="constant">0</span><spanclass="plain">) { </span><spanclass="comment">this should always be true: just being cautious</span>
<spanclass="identifier">lw_array</span><spanclass="plain">[</span><spanclass="identifier">lexer_wordcount</span><spanclass="plain">-1].</span><spanclass="element">lw_text</span><spanclass="plain"> = </span><spanclass="identifier">L</span><spanclass="string">"(-"</span><spanclass="plain">; </span><spanclass="comment">change the previous word's text from <codeclass="display"><spanclass="extract">(</span></code> to <codeclass="display"><spanclass="extract">(-</span></code></span>
<spanclass="plain">*(</span><spanclass="identifier">lexer_hwm</span><spanclass="plain">++) = </span><spanclass="identifier">c</span><spanclass="plain">; </span><spanclass="comment">record the <codeclass="display"><spanclass="extract">STRING_END</span></code> character as part of the word</span>
<spanclass="identifier">lexer_hwm</span><spanclass="plain">--; </span><spanclass="comment">erase the <codeclass="display"><spanclass="extract">INFORM6_ESCAPE_END_1</span></code> character recorded last time</span>
<pclass="endnote">This code is used in <ahref="#SP26">§26</a>.</p>
<pclass="inwebparagraph"><aid="SP26_8"></a><b>§26.8. Breaking strings up at text substitutions. </b>When text contains text substitutions, these are ordinarily ignored by the
lexer, but in <codeclass="display"><spanclass="extract">lexer_divide_strings_at_text_substitutions</span></code> mode, we need to
force strings to end and resume at the two ends of each substitution. For
instance:
</p>
<blockquote>
<p>"Hello, [greeted person]. Do you make it [supper time]?"</p>
</blockquote>
<pclass="inwebparagraph">must be split as
</p>
<blockquote>
<p>|"Hello, " , greeted person , ". Do you make it " , supper time , "?"|</p>
</blockquote>
<pclass="inwebparagraph">where our original single text literal is now three text literals, plus
eight ordinary words (four of them commas).
</p>
<pclass="inwebparagraph">Note that each open square bracket, and each close square bracket, has been
removed and become a comma word. We see to open squares before we come
to recording the character, so to get rid of the <codeclass="display"><spanclass="extract">[</span></code> character, we change
<codeclass="display"><spanclass="extract">c</span></code> to a space:
</p>
<pclass="macrodefinition"><codeclass="display">
<<spanclass="cwebmacrodefn">Force string division at the start of a text substitution, if necessary</span><spanclass="cwebmacronumber">26.8</span>> =
<spanclass="functiontext">Lexer::feed_char_into_lexer</span><spanclass="plain">(</span><spanclass="constant">STRING_END</span><spanclass="plain">); </span><spanclass="comment">feed <codeclass="display"><spanclass="extract">"</span></code> to close the old string</span>
<spanclass="functiontext">Lexer::feed_char_into_lexer</span><spanclass="plain">(</span><spanclass="constant">TEXT_SUBSTITUTION_SEPARATOR</span><spanclass="plain">); </span><spanclass="comment">feed <codeclass="display"><spanclass="extract">,</span></code> to start new word</span>
<spanclass="identifier">c</span><spanclass="plain"> = </span><spanclass="character">' '</span><spanclass="plain">; </span><spanclass="comment">the lexer now goes on to record a space, which will end the <codeclass="display"><spanclass="extract">,</span></code> word</span>
<spanclass="identifier">lxs_scanning_text_substitution</span><spanclass="plain"> = </span><spanclass="identifier">TRUE</span><spanclass="plain">; </span><spanclass="comment">but remember that we must get back again</span>
<pclass="endnote">This code is used in <ahref="#SP26">§26</a>.</p>
<pclass="inwebparagraph"><aid="SP26_9"></a><b>§26.9. </b>Whereas we see to close squares after recording the character, so we have
to erase it to get rid of the <codeclass="display"><spanclass="extract">]</span></code>. Note that since this was read in ordinary
mode, it was automatically spaced (being punctuation), and that therefore
the feeder above has just sent the second of a sequence of three characters:
space, <codeclass="display"><spanclass="extract">]</span></code>, space. That means we have recorded, so far, a one-character
word in ordinary mode, whose text consists only of <codeclass="display"><spanclass="extract">]</span></code>. By overwriting
this with a comma, we instead get a one-character word in ordinary mode
whose text consists only of a comma. We then feed a space to end that word;
then feed a double-quote to start text again.
</p>
<pclass="inwebparagraph">But, it might be objected: surely the feeder above is still poised with
that third character in its sequence space, <codeclass="display"><spanclass="extract">]</span></code>, space, and that means
it will now feed a spurious space into the start of our resumed text?
Happily, the answer is no: this is why the feeder above checks that it
is still in ordinary mode before sending that third character. Having
open quotes again, we have put the lexer into literal mode: and so the
spurious space is never fed, and there is no problem.
</p>
<pclass="macrodefinition"><codeclass="display">
<<spanclass="cwebmacrodefn">Force string division at the end of a text substitution, if necessary</span><spanclass="cwebmacronumber">26.9</span>> =
<spanclass="plain">*(</span><spanclass="identifier">lexer_hwm</span><spanclass="plain">-1) = </span><spanclass="constant">TEXT_SUBSTITUTION_SEPARATOR</span><spanclass="plain">; </span><spanclass="comment">overwrite recorded copy of <codeclass="display"><spanclass="extract">]</span></code> with <codeclass="display"><spanclass="extract">,</span></code></span>
<spanclass="functiontext">Lexer::feed_char_into_lexer</span><spanclass="plain">(</span><spanclass="character">' '</span><spanclass="plain">); </span><spanclass="comment">then feed a space to end the <codeclass="display"><spanclass="extract">,</span></code> word</span>
<spanclass="functiontext">Lexer::feed_char_into_lexer</span><spanclass="plain">(</span><spanclass="constant">STRING_BEGIN</span><spanclass="plain">); </span><spanclass="comment">then feed <codeclass="display"><spanclass="extract">"</span></code> to open a new string</span>
<pclass="endnote">This code is used in <ahref="#SP26">§26</a>.</p>
<pclass="inwebparagraph"><aid="SP27"></a><b>§27. </b>Finally, note that the breaking-up process may result in empty strings
where square brackets abut each other or the ends of the original string.
Thus
</p>
<blockquote>
<p>"[The noun] is on the [colour][style] table."</p>
</blockquote>
<pclass="inwebparagraph">is split as: <codeclass="display"><spanclass="extract">"" , The noun , " is on the " , colour , "" , style , " table."</span></code>
This is not a bug: empty strings are legal. It's for higher-level code to
remove them if they aren't wanted.
</p>
<pclass="inwebparagraph"><aid="SP28"></a><b>§28. Splicing. </b>Once in a while, we need to have a run of words in the lexer which
all do occur in the source text, but not contiguously, so that they
cannot be represented by a pair <codeclass="display"><spanclass="extract">(w1, w2)</span></code>. In that event we use the
following routine to splice duplicate references at the end of the word
list (this does not duplicate the text itself, only references to it):
for instance, if we start with 10 words (0 to 9) and then splice <codeclass="display"><spanclass="extract">(2,3)</span></code>
and then <codeclass="display"><spanclass="extract">(6,8)</span></code>, we end up with 15 words, and the text of <codeclass="display"><spanclass="extract">(10,14)</span></code>
contains the same material as words 2, 3, 6, 7, 8.
<pclass="inwebparagraph"><aid="SP29"></a><b>§29. Basic command-line error handler. </b>Some tools using this module will want to push simple error messages out to
the command line; others will want to translate them into elaborate problem
texts in HTML. So the client is allowed to define <codeclass="display"><spanclass="extract">LEXER_PROBLEM_HANDLER</span></code>
<spanclass="identifier">Errors::fatal</span><spanclass="plain">(</span><spanclass="string">"Out of memory: unable to create lexer workspace"</span><spanclass="plain">);</span>
<spanclass="identifier">Errors::with_text</span><spanclass="plain">(</span><spanclass="string">"Too much text in quotation marks: %S"</span><spanclass="plain">, </span><spanclass="identifier">word_t</span><spanclass="plain">);</span>
<spanclass="identifier">Errors::with_text</span><spanclass="plain">(</span><spanclass="string">"Word too long: %S"</span><spanclass="plain">, </span><spanclass="identifier">word_t</span><spanclass="plain">);</span>
<spanclass="identifier">Errors::with_text</span><spanclass="plain">(</span><spanclass="string">"Quoted text never ends: %S"</span><spanclass="plain">, </span><spanclass="identifier">problem_source_description</span><spanclass="plain">);</span>
<spanclass="identifier">Errors::with_text</span><spanclass="plain">(</span><spanclass="string">"Square-bracketed text never ends: %S"</span><spanclass="plain">, </span><spanclass="identifier">problem_source_description</span><spanclass="plain">);</span>
<spanclass="identifier">Errors::with_text</span><spanclass="plain">(</span><spanclass="string">"I6 inclusion text never ends: %S"</span><spanclass="plain">, </span><spanclass="identifier">problem_source_description</span><spanclass="plain">);</span>
<pclass="endnote">The function Lexer::lexer_problem_handler is used in <ahref="#SP12">§12</a>, <ahref="#SP24_3">§24.3</a>, <ahref="#SP26_5_1">§26.5.1</a>.</p>
<ulclass="toc"><li><i>(This section begins Chapter 3: Words in Sequence.)</i></li><li><ahref="3-wrd.html">Continue with 'Wordings'</a></li></ul><hrclass="tocbar">