mirror of
https://github.com/ganelson/inform.git
synced 2024-07-16 22:14:23 +03:00
1554 lines
146 KiB
HTML
1554 lines
146 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
|
|
<html>
|
|
<head>
|
|
<title>2/wa</title>
|
|
<meta name="viewport" content="width=device-width initial-scale=1">
|
|
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
|
|
<meta http-equiv="Content-Language" content="en-gb">
|
|
<link href="../inweb.css" rel="stylesheet" rev="stylesheet" type="text/css">
|
|
</head>
|
|
<body>
|
|
<nav role="navigation">
|
|
<h1><a href="../webs.html">Sources</a></h1>
|
|
<ul>
|
|
<li><a href="../compiler.html"><b>compiler</b></a></li>
|
|
<li><a href="../other.html">other tools</a></li>
|
|
<li><a href="../extensions.html">extensions and kits</a></li>
|
|
<li><a href="../units.html">unit test tools</a></li>
|
|
</ul>
|
|
<h2>Compiler Webs</h2>
|
|
<ul>
|
|
<li><a href="../inbuild/index.html">inbuild</a></li>
|
|
<li><a href="../inform7/index.html">inform7</a></li>
|
|
<li><a href="../inter/index.html">inter</a></li>
|
|
</ul>
|
|
<h2>Inbuild Modules</h2>
|
|
<ul>
|
|
<li><a href="../inbuild-module/index.html">inbuild</a></li>
|
|
<li><a href="../arch-module/index.html">arch</a></li>
|
|
<li><a href="../words-module/index.html">words</a></li>
|
|
<li><a href="../syntax-module/index.html">syntax</a></li>
|
|
<li><a href="../html-module/index.html">html</a></li>
|
|
</ul>
|
|
<h2>Inform7 Modules</h2>
|
|
<ul>
|
|
<li><a href="../core-module/index.html">core</a></li>
|
|
<li><a href="../problems-module/index.html">problems</a></li>
|
|
<li><a href="../inflections-module/index.html">inflections</a></li>
|
|
<li><a href="../linguistics-module/index.html">linguistics</a></li>
|
|
<li><a href="../kinds-module/index.html">kinds</a></li>
|
|
<li><a href="../if-module/index.html">if</a></li>
|
|
<li><a href="../multimedia-module/index.html">multimedia</a></li>
|
|
<li><a href="../index-module/index.html">index</a></li>
|
|
</ul>
|
|
<h2>Inter Modules</h2>
|
|
<ul>
|
|
<li><a href="../inter-module/index.html">inter</a></li>
|
|
<li><a href="../building-module/index.html">building</a></li>
|
|
<li><a href="../codegen-module/index.html">codegen</a></li>
|
|
</ul>
|
|
<h2>Foundation</h2>
|
|
<ul>
|
|
<li><a href="../../../inweb/docs/foundation-module/index.html">foundation</a></li>
|
|
</ul>
|
|
|
|
|
|
</nav>
|
|
<main role="main">
|
|
|
|
<!--Weave of '3/lxr' generated by 7-->
|
|
<ul class="crumbs"><li><a href="../webs.html">Source</a></li><li><a href="../compiler.html">Compiler Modules</a></li><li><a href="index.html">words</a></li><li><a href="index.html#3">Chapter 3: Words in Sequence</a></li><li><b>Lexer</b></li></ul><p class="purpose">To break down a stream of characters into a numbered sequence of words, literal strings and literal I6 inclusions, removing comments and unnecessary whitespace.</p>
|
|
|
|
<ul class="toc"><li><a href="#SP1">§1. Definitions</a></li><li><a href="#SP5">§5. The lexical structure of source text</a></li><li><a href="#SP9">§9. What the lexer stores for each word</a></li><li><a href="#SP15">§15. External lexer states</a></li><li><a href="#SP16">§16. Definition of punctuation</a></li><li><a href="#SP17">§17. Definition of indentation</a></li><li><a href="#SP18">§18. Access functions</a></li><li><a href="#SP19">§19. Definition of white space</a></li><li><a href="#SP20">§20. Internal lexer states</a></li><li><a href="#SP24">§24. Feeding the lexer</a></li><li><a href="#SP26">§26. Lexing one character at a time</a></li><li><a href="#SP26_1">§26.1. Dealing with whitespace</a></li><li><a href="#SP26_5">§26.5. Completing a word</a></li><li><a href="#SP26_6">§26.6. Entering and leaving literal mode</a></li><li><a href="#SP26_8">§26.8. Breaking strings up at text substitutions</a></li><li><a href="#SP28">§28. Splicing</a></li></ul><hr class="tocbar">
|
|
|
|
<p class="inwebparagraph"><a id="SP1"></a><b>§1. Definitions. </b></p>
|
|
|
|
<p class="inwebparagraph"><a id="SP2"></a><b>§2. </b>Lexical analysis is the process of reading characters from the source
|
|
text files and forming them into globs which we call "words": the part of
|
|
Inform which does this is the "lexical analyser", or lexer for short. The
|
|
algorithms in this chapter are entirely routine, but occasional eye-opening
|
|
moments come because natural language does not have the rigorous division
|
|
between lexical and semantic parsing which programming language theory
|
|
expects. For instance, we want NI to be case insensitive for the most part,
|
|
but we cannot discard upper case entirely at the lexical stage because we
|
|
will need it later to decide whether punctuation at the end of a quotation
|
|
is meant to end the sentence making the quote, or not. Humans certainly
|
|
read these differently:
|
|
</p>
|
|
|
|
<blockquote>
|
|
<p>Say "Hello!" with alarm, ... Say "Hello!" With alarm, ...</p>
|
|
|
|
</blockquote>
|
|
|
|
<p class="inwebparagraph">And paragraph breaks can also have semantic meanings. A gap between two words
|
|
does not end a sentence, but a paragraph break between two words clearly does.
|
|
So semantic considerations occasionally infiltrate themselves into even the
|
|
earliest parts of this chapter.
|
|
</p>
|
|
|
|
<p class="inwebparagraph"><a id="SP3"></a><b>§3. </b>We must never lose sight of the origin of text, because we may need to
|
|
print problem messages back to the user which refer to that original material.
|
|
We record the provenance of text using the following structure; the
|
|
<code class="display"><span class="extract">lexer_position</span></code> is such a structure, and marks where the lexer is
|
|
currently reading.
|
|
</p>
|
|
|
|
|
|
<pre class="display">
|
|
<span class="reserved">typedef</span><span class="plain"> </span><span class="reserved">struct</span><span class="plain"> </span><span class="reserved">source_location</span><span class="plain"> {</span>
|
|
<span class="reserved">struct</span><span class="plain"> </span><span class="reserved">source_file</span><span class="plain"> *</span><span class="identifier">file_of_origin</span><span class="plain">; </span> <span class="comment">or <code class="display"><span class="extract">NULL</span></code> if internally written and not from a file</span>
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="identifier">line_number</span><span class="plain">; </span> <span class="comment">counting upwards from 1 within file (if any)</span>
|
|
<span class="plain">} </span><span class="reserved">source_location</span><span class="plain">;</span>
|
|
|
|
<span class="reserved">source_location</span><span class="plain"> </span><span class="identifier">lexer_position</span><span class="plain">;</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="endnote">The structure source_location is accessed in 3/tff, 3/fds and here.</p>
|
|
|
|
<p class="inwebparagraph"><a id="SP4"></a><b>§4. </b>A word can be an English word such as <code class="display"><span class="extract">bedspread</span></code>, or a piece of punctuation
|
|
such as <code class="display"><span class="extract">!</span></code>, or a number such as <code class="display"><span class="extract">127</span></code>, or a piece of quoted text of arbitrary
|
|
size such as <code class="display"><span class="extract">"I summon up remembrance of things past"</span></code>.
|
|
</p>
|
|
|
|
<p class="inwebparagraph">The words found are numbered 0, 1, 2, ... in order of being read by
|
|
the lexer. The first eight or so words come from the mandatory insertion
|
|
text (see Read Source Text.w), then come the words from the primary source
|
|
text, then those from the extensions loaded.
|
|
</p>
|
|
|
|
<p class="inwebparagraph">References to text throughout NI's data structure are often in the form
|
|
of a pair of word numbers, usually called <code class="display"><span class="extract">w1</span></code> and <code class="display"><span class="extract">w2</span></code> or some variation
|
|
on that, indicating the text which starts at word <code class="display"><span class="extract">w1</span></code> and finishes
|
|
at <code class="display"><span class="extract">w2</span></code> (including both ends). Thus if the text is
|
|
</p>
|
|
|
|
<blockquote>
|
|
<p>When to the sessions of sweet silent thought</p>
|
|
|
|
</blockquote>
|
|
|
|
<p class="inwebparagraph">then the eight words are numbered 0 to 7 and a reference to <code class="display"><span class="extract">w1=2</span></code>, <code class="display"><span class="extract">w2=5</span></code>
|
|
would mean the sub-text "the sessions of sweet". The special null value
|
|
<code class="display"><span class="extract">wn=-1</span></code> is used when no word reference has been made: never 0, as that would
|
|
mean the first word in the list. The maximum legal word number is always one
|
|
less than the following variable's value.
|
|
</p>
|
|
|
|
|
|
<pre class="display">
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="identifier">lexer_wordcount</span><span class="plain">; </span> <span class="comment">Number of words read in to arrays</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="inwebparagraph"><a id="SP5"></a><b>§5. The lexical structure of source text. </b>The following definitions are fairly self-evident: they specify which
|
|
characters cause word divisions, or signal literals.
|
|
</p>
|
|
|
|
|
|
<pre class="definitions">
|
|
<span class="definitionkeyword">define</span> <span class="constant">STRING_BEGIN</span><span class="plain"> </span><span class="character">'"'</span><span class="plain"> /* </span><span class="identifier">Strings</span><span class="plain"> </span><span class="identifier">are</span><span class="plain"> </span><span class="identifier">always</span><span class="plain"> </span><span class="reserved">double</span><span class="plain">-</span><span class="identifier">quoted</span><span class="plain"> */</span>
|
|
<span class="definitionkeyword">define</span> <span class="constant">STRING_END</span><span class="plain"> </span><span class="character">'"'</span>
|
|
<span class="definitionkeyword">define</span> <span class="constant">TEXT_SUBSTITUTION_BEGIN</span><span class="plain"> </span><span class="character">'['</span><span class="plain"> </span> <span class="comment">Inside strings, this denotes a text substitution</span>
|
|
<span class="definitionkeyword">define</span> <span class="constant">TEXT_SUBSTITUTION_END</span><span class="plain"> </span><span class="character">']'</span>
|
|
<span class="definitionkeyword">define</span> <span class="constant">TEXT_SUBSTITUTION_SEPARATOR</span><span class="plain"> </span><span class="character">','</span>
|
|
<span class="definitionkeyword">define</span> <span class="constant">COMMENT_BEGIN</span><span class="plain"> </span><span class="character">'['</span><span class="plain"> </span> <span class="comment">Text between these, outside strings, is comment</span>
|
|
<span class="definitionkeyword">define</span> <span class="constant">COMMENT_END</span><span class="plain"> </span><span class="character">']'</span>
|
|
<span class="definitionkeyword">define</span> <span class="constant">INFORM6_ESCAPE_BEGIN_1</span><span class="plain"> </span><span class="character">'('</span><span class="plain"> </span> <span class="comment">Text beginning with this pair is literal I6 code</span>
|
|
<span class="definitionkeyword">define</span> <span class="constant">INFORM6_ESCAPE_BEGIN_2</span><span class="plain"> </span><span class="character">'-'</span>
|
|
<span class="definitionkeyword">define</span> <span class="constant">INFORM6_ESCAPE_END_1</span><span class="plain"> </span><span class="character">'-'</span>
|
|
<span class="definitionkeyword">define</span> <span class="constant">INFORM6_ESCAPE_END_2</span><span class="plain"> </span><span class="character">')'</span>
|
|
<span class="definitionkeyword">define</span> <span class="constant">PARAGRAPH_BREAK</span><span class="plain"> </span><span class="identifier">L</span><span class="string">"|__"</span><span class="plain"> </span> <span class="comment">Inserted as a special word to mark paragraph breaks</span>
|
|
<span class="definitionkeyword">define</span> <span class="constant">UNICODE_CHAR_IN_STRING</span><span class="plain"> ((</span><span class="identifier">wchar_t</span><span class="plain">) 0</span><span class="identifier">x1b</span><span class="plain">) </span> <span class="comment">To represent awkward characters in metadata only</span>
|
|
</pre>
|
|
<p class="inwebparagraph"><a id="SP6"></a><b>§6. </b>This is the standard set used for parsing source text.
|
|
</p>
|
|
|
|
|
|
<pre class="definitions">
|
|
<span class="definitionkeyword">define</span> <span class="constant">STANDARD_PUNCTUATION_MARKS</span><span class="plain"> </span><span class="identifier">L</span><span class="string">".,:;?!(){}[]"</span><span class="plain"> </span> <span class="comment">Do not add to this list lightly!</span>
|
|
</pre>
|
|
<p class="inwebparagraph"><a id="SP7"></a><b>§7. </b>This seems a good point to describe how best to syntax-colour source
|
|
text, something which the user interfaces do on every platform. By
|
|
convention we are sparing with the colours: ordinary word-processing
|
|
is not a kaleidoscopic experience (even when Microsoft Word's impertinent
|
|
grammar checker is accidentally left switched on), and we want the experience
|
|
of writing Inform source text to be like writing, not like programming.
|
|
So we use just a little colour, and that goes a long way.
|
|
</p>
|
|
|
|
<p class="inwebparagraph">Because the Inform applications generally syntax-colour source text in the
|
|
Source panel of the user interface, it is probably worth writing down the
|
|
lexical specification. There are eight basic categories of text, and
|
|
they should be detected in the following order, with the first category
|
|
that applies being the one to determine the colour and/or font weight:
|
|
</p>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<ul class="items"><li>(1) Titling text (primary source text only: not found in extensions).
|
|
If the first non-whitespace in the file is a double-quoted text (see (4a)),
|
|
this is the title of the work.
|
|
</li></ul>
|
|
<ul class="items"><li>(2) Documentation text (extension text only: not found in primary source).
|
|
If a paragraph consists of a single non-whitespace token only, and that
|
|
token is <code class="display"><span class="extract">----</span></code> (four hyphens in a row), then this paragraph and all
|
|
subsequent text down to the bottom of the file.
|
|
</li></ul>
|
|
<ul class="items"><li>(3) Heading text. If a paragraph consists of a single line only and which
|
|
begins with one of the five words Volume, Book, Part, Chapter or Section,
|
|
capitalised as here, then that paragraph is a heading. (A paragraph
|
|
division is found at the start and end of a file, and also at any run
|
|
of white space containing two or more newline characters: a newline
|
|
can be any of the Unicode characters <code class="display"><span class="extract">0x000A</span></code>, <code class="display"><span class="extract">0x2028</span></code> or <code class="display"><span class="extract">0x2029</span></code>.)
|
|
</li></ul>
|
|
<ul class="items"><li>(4a) Quoted text. Outside of (4b) and (4c), a double-quotation mark
|
|
(in principle any of Unicode <code class="display"><span class="extract">0x0022</span></code>, <code class="display"><span class="extract">0x201C</span></code>, <code class="display"><span class="extract">0x201D</span></code>) begins
|
|
quoted text provided it follows either whitespace, or the start of
|
|
the file, or one of the punctuation marks in the <code class="display"><span class="extract">PUNCTUATION_MARKS</span></code>
|
|
string defined above. Quoted text continues until the next
|
|
double-quotation mark (or the end of the file if there isn't one,
|
|
though NI would issue Problems if asked to compile this).
|
|
</li></ul>
|
|
<ul class="items"><li>(4a1) Text substitution text. Within (4a) only, an open square bracket
|
|
introduced text substitution matter which continues until the next
|
|
close square bracket or the end of the quoted text. (Again, NI would
|
|
issue problem messages if given a string malformed in this way.)
|
|
</li></ul>
|
|
<ul class="items"><li>(4b) Comment text. Outside of (4a) and (4c), an open square bracket begins
|
|
comment. Comment continues until the next matching close square
|
|
bracket. (This is the case even if that is in double quotes within the
|
|
comment, i.e., quotation marks should be ignored when matching <code class="display"><span class="extract">[</span></code> and <code class="display"><span class="extract">]</span></code>
|
|
inside a comment.) Thus, nested comments are allowed, and the following
|
|
text contains a single comment running from just after "the" through to
|
|
the full stop:
|
|
</li></ul>
|
|
<blockquote>
|
|
<p>|Snow White and the [Seven Dwarfs [but not Doc]].|</p>
|
|
|
|
</blockquote>
|
|
|
|
<ul class="items"><li>(4c) Literal I6 code. Outside of (4a) and (4b), the combination <code class="display"><span class="extract">(-</span></code> begins
|
|
literal I6 matter. This matter continues until the next <code class="display"><span class="extract">-)</span></code> is reached.
|
|
Within literal I6 matter, one can escape back into I7 source text using a
|
|
matched pair of <code class="display"><span class="extract">(+</span></code> and <code class="display"><span class="extract">+)</span></code> tokens, but it really doesn't seem worth
|
|
syntax colouring this very much. And the authors of Inform will lose no
|
|
sleep if we miscolour this, for instance, especially if it deters people
|
|
from such horrible coding practices:
|
|
</li></ul>
|
|
<blockquote>
|
|
<p>|(- Constant BLOB = (+ the total weight of things in (- selfobj -) +); -)|</p>
|
|
|
|
</blockquote>
|
|
|
|
<ul class="items"><li>(5) Normal text. Everything else.
|
|
</li></ul>
|
|
<p class="inwebparagraph">NI regards all of the Unicode characters <code class="display"><span class="extract">0x0009</span></code>, <code class="display"><span class="extract">0x000A</span></code>, <code class="display"><span class="extract">0x000D</span></code>,
|
|
<code class="display"><span class="extract">0x0020</span></code>, <code class="display"><span class="extract">0x0085</span></code>, <code class="display"><span class="extract">0x00A0</span></code>, <code class="display"><span class="extract">0x02000</span></code> to <code class="display"><span class="extract">0x200A</span></code>, <code class="display"><span class="extract">0x2028</span></code> and <code class="display"><span class="extract">0x2029</span></code>
|
|
as instances of white space. Of course, it's entirely open to the Inform
|
|
user interfaces to not allow the user to key some of these codes, but
|
|
we should bear in mind that projects using them might be created on one
|
|
platform and then reopened on another one, so it's probably best to be
|
|
careful.
|
|
</p>
|
|
|
|
<p class="inwebparagraph"><a id="SP8"></a><b>§8. </b>These categories of text are conventionally displayed as follows:
|
|
</p>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<ul class="items"><li>(1) Titling text: black boldface.
|
|
</li></ul>
|
|
<ul class="items"><li>(2) Documentation text: grey type.
|
|
</li></ul>
|
|
<ul class="items"><li>(3) Heading text: black boldface, perhaps of a slightly larger point
|
|
size.
|
|
</li></ul>
|
|
<ul class="items"><li>(4a) Quoted text: dark blue boldface.
|
|
</li></ul>
|
|
<ul class="items"><li>(4a1) Text substitution text: lighter blue and not boldface.
|
|
</li></ul>
|
|
<ul class="items"><li>(4b) Comment text: darkish green type, perhaps of a slightly smaller point
|
|
size.
|
|
</li></ul>
|
|
<ul class="items"><li>(4c) Literal I6 code: grey type. (Inform for OS X rather coolly goes into
|
|
I6 syntax-colouring, which is considerably harder, for this material:
|
|
see "The Inform 6 Technical Manual" for an algorithm.)
|
|
</li></ul>
|
|
<ul class="items"><li>(5) Normal text: black type.
|
|
</li></ul>
|
|
<p class="inwebparagraph"><a id="SP9"></a><b>§9. What the lexer stores for each word. </b>The lexer builds a small data structure for each individual word it reads.
|
|
</p>
|
|
|
|
|
|
<pre class="display">
|
|
<span class="reserved">typedef</span><span class="plain"> </span><span class="reserved">struct</span><span class="plain"> </span><span class="reserved">lexer_details</span><span class="plain"> {</span>
|
|
<span class="identifier">wchar_t</span><span class="plain"> *</span><span class="identifier">lw_text</span><span class="plain">; </span> <span class="comment">text of word after treatment to normalise</span>
|
|
<span class="identifier">wchar_t</span><span class="plain"> *</span><span class="identifier">lw_rawtext</span><span class="plain">; </span> <span class="comment">original untouched text of word</span>
|
|
<span class="reserved">struct</span><span class="plain"> </span><span class="reserved">source_location</span><span class="plain"> </span><span class="identifier">lw_source</span><span class="plain">; </span> <span class="comment">where it was read from</span>
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="identifier">lw_break</span><span class="plain">; </span> <span class="comment">the divider (space, tab, etc.) preceding it</span>
|
|
<span class="reserved">struct</span><span class="plain"> </span><span class="reserved">vocabulary_entry</span><span class="plain"> *</span><span class="identifier">lw_identity</span><span class="plain">; </span> <span class="comment">which distinct word</span>
|
|
<span class="plain">} </span><span class="reserved">lexer_details</span><span class="plain">;</span>
|
|
|
|
<span class="reserved">lexer_details</span><span class="plain"> *</span><span class="identifier">lw_array</span><span class="plain"> = </span><span class="identifier">NULL</span><span class="plain">; </span> <span class="comment">a dynamically allocated (and mobile) array</span>
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="identifier">lexer_details_memory_allocated</span><span class="plain"> = 0; </span> <span class="comment">bytes allocated to this array</span>
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="identifier">lexer_workspace_allocated</span><span class="plain"> = 0; </span> <span class="comment">bytes allocated to text storage</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="endnote">The structure lexer_details is private to this section.</p>
|
|
|
|
<p class="inwebparagraph"><a id="SP10"></a><b>§10. </b>The following bounds on how much we can read are immutable without
|
|
editing and recompiling Inform.
|
|
</p>
|
|
|
|
<p class="inwebparagraph">Some readers will be wondering about Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogochuchaf
|
|
(the upper old part of the village of Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch,
|
|
on the Welsh isle of Anglesey), but this has a mere 63 letters, and in any case
|
|
the name was "improved" by the village cobbler in the mid-19th century to
|
|
make it a tourist attraction for the new railway age.
|
|
</p>
|
|
|
|
|
|
<pre class="definitions">
|
|
<span class="definitionkeyword">define</span> <span class="constant">TEXT_STORAGE_CHUNK_SIZE</span><span class="plain"> 600000 </span> <span class="comment">Must exceed <code class="display"><span class="extract">MAX_VERBATIM_LENGTH+MAX_WORD_LENGTH</span></code></span>
|
|
<span class="definitionkeyword">define</span> <span class="constant">MAX_VERBATIM_LENGTH</span><span class="plain"> 200000 </span> <span class="comment">Largest quantity of Inform 6 which can be quoted verbatim.</span>
|
|
<span class="definitionkeyword">define</span> <span class="constant">MAX_WORD_LENGTH</span><span class="plain"> 128 </span> <span class="comment">Maximum length of any unquoted word</span>
|
|
</pre>
|
|
<p class="inwebparagraph"><a id="SP11"></a><b>§11. </b>The main text area of memory has a simple structure: it is allocated in
|
|
one contiguous block, and at any given time the memory is used from the
|
|
lowest address up to (but not including) the "high water mark", a pointer
|
|
in effect to the first free character.
|
|
</p>
|
|
|
|
|
|
<pre class="display">
|
|
<span class="identifier">wchar_t</span><span class="plain"> *</span><span class="identifier">lexer_workspace</span><span class="plain">; </span> <span class="comment">Large area of contiguous memory for text</span>
|
|
<span class="identifier">wchar_t</span><span class="plain"> *</span><span class="identifier">lexer_word</span><span class="plain">; </span> <span class="comment">Start of current word in workspace</span>
|
|
<span class="identifier">wchar_t</span><span class="plain"> *</span><span class="identifier">lexer_hwm</span><span class="plain">; </span> <span class="comment">High water mark of workspace</span>
|
|
<span class="identifier">wchar_t</span><span class="plain"> *</span><span class="identifier">lexer_workspace_end</span><span class="plain">; </span> <span class="comment">Pointer to just past the end of the workspace: HWM must not exceed this</span>
|
|
|
|
<span class="reserved">void</span><span class="plain"> </span><span class="functiontext">Lexer::start</span><span class="plain">(</span><span class="reserved">void</span><span class="plain">) {</span>
|
|
<span class="identifier">lexer_wordcount</span><span class="plain"> = 0;</span>
|
|
<span class="functiontext">Lexer::ensure_space_up_to</span><span class="plain">(50000); </span> <span class="comment">the Standard Rules are about 44,000 words</span>
|
|
<span class="functiontext">Lexer::allocate_lexer_workspace_chunk</span><span class="plain">(1);</span>
|
|
<span class="functiontext">Vocabulary::start_hash_table</span><span class="plain">();</span>
|
|
<span class="plain">}</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="endnote">The function Lexer::start is used in 1/wm (<a href="1-wm.html#SP3">§3</a>).</p>
|
|
|
|
<p class="inwebparagraph"><a id="SP12"></a><b>§12. </b>These are quite hefty memory allocations, with the expensive one —
|
|
<code class="display"><span class="extract">lw_source</span></code> — also being the least essential to NI's running. But at least
|
|
we use memory in a way at least vaguely related to the size of the source
|
|
text, never using more than twice what we need, and we impose no absolute
|
|
upper limits.
|
|
</p>
|
|
|
|
|
|
<pre class="display">
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="identifier">current_lw_array_size</span><span class="plain"> = 0, </span><span class="identifier">next_lw_array_size</span><span class="plain"> = 75000;</span>
|
|
|
|
<span class="reserved">void</span><span class="plain"> </span><span class="functiontext">Lexer::ensure_space_up_to</span><span class="plain">(</span><span class="reserved">int</span><span class="plain"> </span><span class="identifier">n</span><span class="plain">) {</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">n</span><span class="plain"> < </span><span class="identifier">current_lw_array_size</span><span class="plain">) </span><span class="reserved">return</span><span class="plain">;</span>
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="identifier">new_size</span><span class="plain"> = </span><span class="identifier">current_lw_array_size</span><span class="plain">;</span>
|
|
<span class="reserved">while</span><span class="plain"> (</span><span class="identifier">n</span><span class="plain"> >= </span><span class="identifier">new_size</span><span class="plain">) {</span>
|
|
<span class="identifier">new_size</span><span class="plain"> = </span><span class="identifier">next_lw_array_size</span><span class="plain">;</span>
|
|
<span class="identifier">next_lw_array_size</span><span class="plain"> = </span><span class="identifier">next_lw_array_size</span><span class="plain">*2;</span>
|
|
<span class="plain">}</span>
|
|
<span class="identifier">lexer_details_memory_allocated</span><span class="plain"> = </span><span class="identifier">new_size</span><span class="plain">*((</span><span class="reserved">int</span><span class="plain">) </span><span class="reserved">sizeof</span><span class="plain">(</span><span class="reserved">lexer_details</span><span class="plain">));</span>
|
|
<span class="reserved">lexer_details</span><span class="plain"> *</span><span class="identifier">new_lw_array</span><span class="plain"> =</span>
|
|
<span class="plain">((</span><span class="reserved">lexer_details</span><span class="plain"> *) (</span><span class="identifier">Memory::I7_calloc</span><span class="plain">(</span><span class="identifier">new_size</span><span class="plain">, </span><span class="reserved">sizeof</span><span class="plain">(</span><span class="reserved">lexer_details</span><span class="plain">), </span><span class="constant">LEXER_WORDS_MREASON</span><span class="plain">)));</span>
|
|
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">new_lw_array</span><span class="plain"> == </span><span class="identifier">NULL</span><span class="plain">) {</span>
|
|
<span class="identifier">LEXER_PROBLEM_HANDLER</span><span class="plain">(</span><span class="constant">MEMORY_OUT_LEXERERROR</span><span class="plain">, </span><span class="identifier">NULL</span><span class="plain">, </span><span class="identifier">NULL</span><span class="plain">);</span>
|
|
<span class="identifier">exit</span><span class="plain">(1); </span> <span class="comment">in case the handler fails to do this</span>
|
|
<span class="plain">}</span>
|
|
<span class="reserved">for</span><span class="plain"> (</span><span class="reserved">int</span><span class="plain"> </span><span class="identifier">i</span><span class="plain">=0; </span><span class="identifier">i</span><span class="plain"><</span><span class="identifier">new_size</span><span class="plain">; </span><span class="identifier">i</span><span class="plain">++) {</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">i</span><span class="plain"> < </span><span class="identifier">current_lw_array_size</span><span class="plain">) </span><span class="identifier">new_lw_array</span><span class="plain">[</span><span class="identifier">i</span><span class="plain">] = </span><span class="identifier">lw_array</span><span class="plain">[</span><span class="identifier">i</span><span class="plain">];</span>
|
|
<span class="reserved">else</span><span class="plain"> {</span>
|
|
<span class="identifier">new_lw_array</span><span class="plain">[</span><span class="identifier">i</span><span class="plain">]</span><span class="element">.lw_text</span><span class="plain"> = </span><span class="identifier">NULL</span><span class="plain">;</span>
|
|
<span class="identifier">new_lw_array</span><span class="plain">[</span><span class="identifier">i</span><span class="plain">]</span><span class="element">.lw_rawtext</span><span class="plain"> = </span><span class="identifier">NULL</span><span class="plain">;</span>
|
|
<span class="identifier">new_lw_array</span><span class="plain">[</span><span class="identifier">i</span><span class="plain">]</span><span class="element">.lw_break</span><span class="plain"> = </span><span class="character">' '</span><span class="plain">;</span>
|
|
<span class="identifier">new_lw_array</span><span class="plain">[</span><span class="identifier">i</span><span class="plain">]</span><span class="element">.lw_source.file_of_origin</span><span class="plain"> = </span><span class="identifier">NULL</span><span class="plain">;</span>
|
|
<span class="identifier">new_lw_array</span><span class="plain">[</span><span class="identifier">i</span><span class="plain">]</span><span class="element">.lw_source.line_number</span><span class="plain"> = -1;</span>
|
|
<span class="identifier">new_lw_array</span><span class="plain">[</span><span class="identifier">i</span><span class="plain">]</span><span class="element">.lw_identity</span><span class="plain"> = </span><span class="identifier">NULL</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
<span class="plain">}</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">lw_array</span><span class="plain">) </span><span class="identifier">Memory::I7_array_free</span><span class="plain">(</span><span class="identifier">lw_array</span><span class="plain">, </span><span class="constant">LEXER_WORDS_MREASON</span><span class="plain">,</span>
|
|
<span class="identifier">current_lw_array_size</span><span class="plain">, ((</span><span class="reserved">int</span><span class="plain">) </span><span class="reserved">sizeof</span><span class="plain">(</span><span class="reserved">lexer_details</span><span class="plain">)));</span>
|
|
<span class="identifier">lw_array</span><span class="plain"> = </span><span class="identifier">new_lw_array</span><span class="plain">;</span>
|
|
<span class="identifier">current_lw_array_size</span><span class="plain"> = </span><span class="identifier">new_size</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="endnote">The function Lexer::ensure_space_up_to is used in <a href="#SP11">§11</a>, <a href="#SP26_5_2">§26.5.2</a>, <a href="#SP28">§28</a>.</p>
|
|
|
|
<p class="inwebparagraph"><a id="SP13"></a><b>§13. </b>Inform would almost certainly crash if we wrote past the end of the
|
|
workspace, so we need to watch for the water running high. The following
|
|
routine checks that there is room for another <code class="display"><span class="extract">n</span></code> characters, plus a
|
|
termination character, plus breathing space for a single character's worth
|
|
of lookahead:
|
|
</p>
|
|
|
|
|
|
<pre class="display">
|
|
<span class="reserved">void</span><span class="plain"> </span><span class="functiontext">Lexer::ensure_lexer_hwm_can_be_raised_by</span><span class="plain">(</span><span class="reserved">int</span><span class="plain"> </span><span class="identifier">n</span><span class="plain">, </span><span class="reserved">int</span><span class="plain"> </span><span class="identifier">transfer_partial_word</span><span class="plain">) {</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">lexer_hwm</span><span class="plain"> + </span><span class="identifier">n</span><span class="plain"> + 2 >= </span><span class="identifier">lexer_workspace_end</span><span class="plain">) {</span>
|
|
<span class="identifier">wchar_t</span><span class="plain"> *</span><span class="identifier">old_hwm</span><span class="plain"> = </span><span class="identifier">lexer_hwm</span><span class="plain">;</span>
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="identifier">m</span><span class="plain"> = 1;</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">transfer_partial_word</span><span class="plain">) {</span>
|
|
<span class="identifier">m</span><span class="plain"> = (((</span><span class="reserved">int</span><span class="plain">) (</span><span class="identifier">old_hwm</span><span class="plain"> - </span><span class="identifier">lexer_word</span><span class="plain">) + </span><span class="identifier">n</span><span class="plain"> + 3)/</span><span class="constant">TEXT_STORAGE_CHUNK_SIZE</span><span class="plain">) + 1;</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">m</span><span class="plain"> < 1) </span><span class="identifier">m</span><span class="plain"> = 1;</span>
|
|
<span class="plain">}</span>
|
|
<span class="functiontext">Lexer::allocate_lexer_workspace_chunk</span><span class="plain">(</span><span class="identifier">m</span><span class="plain">);</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">transfer_partial_word</span><span class="plain">) {</span>
|
|
<span class="plain">*(</span><span class="identifier">lexer_hwm</span><span class="plain">++) = </span><span class="character">' '</span><span class="plain">;</span>
|
|
<span class="identifier">wchar_t</span><span class="plain"> *</span><span class="identifier">new_lword</span><span class="plain"> = </span><span class="identifier">lexer_hwm</span><span class="plain">;</span>
|
|
<span class="reserved">while</span><span class="plain"> (</span><span class="identifier">lexer_word</span><span class="plain"> < </span><span class="identifier">old_hwm</span><span class="plain">) {</span>
|
|
<span class="plain">*(</span><span class="identifier">lexer_hwm</span><span class="plain">++) = *(</span><span class="identifier">lexer_word</span><span class="plain">++);</span>
|
|
<span class="plain">}</span>
|
|
<span class="identifier">lexer_word</span><span class="plain"> = </span><span class="identifier">new_lword</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">lexer_hwm</span><span class="plain"> + </span><span class="identifier">n</span><span class="plain"> + 2 >= </span><span class="identifier">lexer_workspace_end</span><span class="plain">)</span>
|
|
<span class="identifier">internal_error</span><span class="plain">(</span><span class="string">"further allocation failed to liberate enough space"</span><span class="plain">);</span>
|
|
<span class="plain">}</span>
|
|
<span class="plain">}</span>
|
|
|
|
<span class="reserved">void</span><span class="plain"> </span><span class="functiontext">Lexer::allocate_lexer_workspace_chunk</span><span class="plain">(</span><span class="reserved">int</span><span class="plain"> </span><span class="identifier">multiplier</span><span class="plain">) {</span>
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="identifier">extent</span><span class="plain"> = </span><span class="identifier">multiplier</span><span class="plain"> * </span><span class="constant">TEXT_STORAGE_CHUNK_SIZE</span><span class="plain">;</span>
|
|
<span class="identifier">lexer_workspace</span><span class="plain"> = ((</span><span class="identifier">wchar_t</span><span class="plain"> *) (</span><span class="identifier">Memory::I7_calloc</span><span class="plain">(</span><span class="identifier">extent</span><span class="plain">, </span><span class="reserved">sizeof</span><span class="plain">(</span><span class="identifier">wchar_t</span><span class="plain">), </span><span class="constant">LEXER_TEXT_MREASON</span><span class="plain">)));</span>
|
|
<span class="identifier">lexer_workspace_allocated</span><span class="plain"> += </span><span class="identifier">extent</span><span class="plain">;</span>
|
|
<span class="identifier">lexer_hwm</span><span class="plain"> = </span><span class="identifier">lexer_workspace</span><span class="plain">;</span>
|
|
<span class="identifier">lexer_workspace_end</span><span class="plain"> = </span><span class="identifier">lexer_workspace</span><span class="plain"> + </span><span class="identifier">extent</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="endnote">The function Lexer::ensure_lexer_hwm_can_be_raised_by is used in <a href="#SP14">§14</a>, <a href="#SP26">§26</a>.</p>
|
|
|
|
<p class="endnote">The function Lexer::allocate_lexer_workspace_chunk is used in <a href="#SP11">§11</a>.</p>
|
|
|
|
<p class="inwebparagraph"><a id="SP14"></a><b>§14. </b>We occasionally want to reprocess the text of a word again in higher-level
|
|
parsing, and it's convenient to use the lexer workspace to store the results
|
|
of such a reprocessed text. The following routine makes a persistent copy
|
|
of its argument, then: it should never be used while the lexer is actually
|
|
running.
|
|
</p>
|
|
|
|
|
|
<pre class="display">
|
|
<span class="identifier">wchar_t</span><span class="plain"> *</span><span class="functiontext">Lexer::copy_to_memory</span><span class="plain">(</span><span class="identifier">wchar_t</span><span class="plain"> *</span><span class="identifier">p</span><span class="plain">) {</span>
|
|
<span class="functiontext">Lexer::ensure_lexer_hwm_can_be_raised_by</span><span class="plain">(</span><span class="identifier">Wide::len</span><span class="plain">(</span><span class="identifier">p</span><span class="plain">), </span><span class="identifier">FALSE</span><span class="plain">);</span>
|
|
<span class="identifier">wchar_t</span><span class="plain"> *</span><span class="identifier">q</span><span class="plain"> = </span><span class="identifier">lexer_hwm</span><span class="plain">;</span>
|
|
<span class="identifier">lexer_hwm</span><span class="plain"> = </span><span class="identifier">q</span><span class="plain"> + </span><span class="identifier">Wide::len</span><span class="plain">(</span><span class="identifier">p</span><span class="plain">) + 1;</span>
|
|
<span class="identifier">wcscpy</span><span class="plain">(</span><span class="identifier">q</span><span class="plain">, </span><span class="identifier">p</span><span class="plain">);</span>
|
|
<span class="reserved">return</span><span class="plain"> </span><span class="identifier">q</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="endnote">The function Lexer::copy_to_memory is used in 4/nw (<a href="4-nw.html#SP8">§8</a>).</p>
|
|
|
|
<p class="inwebparagraph"><a id="SP15"></a><b>§15. External lexer states. </b>The lexer is a finite state machine at heart. Its current state is the
|
|
collective value of an extensive set of variables, almost all of them
|
|
flags, but with three exceptions this state is used only within the lexer.
|
|
</p>
|
|
|
|
<p class="inwebparagraph">The three exceptional modes are by default both off and by default they
|
|
stay off: the lexer never goes into either mode by itself.
|
|
</p>
|
|
|
|
<p class="inwebparagraph"><code class="display"><span class="extract">lexer_divide_strings_at_text_substitutions</span></code> is used by some of the lexical writing-back
|
|
machinery, when it has been decided to compile something like
|
|
</p>
|
|
|
|
<blockquote>
|
|
<p>say "[The noun] falls onto [the second noun]."</p>
|
|
|
|
</blockquote>
|
|
|
|
<p class="inwebparagraph">In its ordinary mode, with this setting off, the lexer will render this as
|
|
two words, the second being the entire quoted text. But if
|
|
<code class="display"><span class="extract">lexer_divide_strings_at_text_substitutions</span></code> is set then the text is reinterpreted as
|
|
</p>
|
|
|
|
<blockquote>
|
|
<p>say The noun, " falls onto ", the second noun, "."</p>
|
|
|
|
</blockquote>
|
|
|
|
<p class="inwebparagraph">which runs to eleven words, three of them commas (punctuation always counts
|
|
as a word).
|
|
</p>
|
|
|
|
<p class="inwebparagraph"><code class="display"><span class="extract">lexer_wait_for_dashes</span></code> is set by the extension-reading machinery, in
|
|
cases where it wants to get at the documentation text of an extension but
|
|
does not want to have to fill NI's memory with the source text of its code.
|
|
In this mode, the lexer ignores the whole stream of words until it reaches
|
|
<code class="display"><span class="extract">----</span></code>, the special marker used in extensions to divide source text from
|
|
documentation: it then drops out of this mode and back into normal running,
|
|
so that subsequent words are lexed as usual.
|
|
</p>
|
|
|
|
|
|
<pre class="display">
|
|
<span class="identifier">wchar_t</span><span class="plain"> *</span><span class="identifier">lexer_punctuation_marks</span><span class="plain"> = </span><span class="identifier">L</span><span class="string">""</span><span class="plain">;</span>
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="identifier">lexer_divide_strings_at_text_substitutions</span><span class="plain">; </span> <span class="comment">Break up text substitutions in quoted text</span>
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="identifier">lexer_allow_I6_escapes</span><span class="plain">; </span> <span class="comment">Recognise <code class="display"><span class="extract">(-</span></code> and <code class="display"><span class="extract">-)</span></code></span>
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="identifier">lexer_wait_for_dashes</span><span class="plain">; </span> <span class="comment">Ignore all text until first <code class="display"><span class="extract">----</span></code> found</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="inwebparagraph"><a id="SP16"></a><b>§16. Definition of punctuation. </b>As we have seen, the question of whether something is a punctuation mark
|
|
or not depends slightly on the context:
|
|
</p>
|
|
|
|
|
|
<pre class="display">
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="functiontext">Lexer::is_punctuation</span><span class="plain">(</span><span class="reserved">int</span><span class="plain"> </span><span class="identifier">c</span><span class="plain">) {</span>
|
|
<span class="reserved">for</span><span class="plain"> (</span><span class="reserved">int</span><span class="plain"> </span><span class="identifier">i</span><span class="plain">=0; </span><span class="identifier">lexer_punctuation_marks</span><span class="plain">[</span><span class="identifier">i</span><span class="plain">]; </span><span class="identifier">i</span><span class="plain">++)</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">c</span><span class="plain"> == </span><span class="identifier">lexer_punctuation_marks</span><span class="plain">[</span><span class="identifier">i</span><span class="plain">])</span>
|
|
<span class="reserved">return</span><span class="plain"> </span><span class="identifier">TRUE</span><span class="plain">;</span>
|
|
<span class="reserved">return</span><span class="plain"> </span><span class="identifier">FALSE</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="endnote">The function Lexer::is_punctuation is used in <a href="#SP25">§25</a>, 3/tff (<a href="3-tff.html#SP4">§4</a>).</p>
|
|
|
|
<p class="inwebparagraph"><a id="SP17"></a><b>§17. Definition of indentation. </b>We're going to record the level of indentation in the "break" character.
|
|
We will recognise anything from 1 to 25 tabs as distinct indentation amounts;
|
|
a value of 26 means "26 or more", and at such sizes, indentation isn't
|
|
distinguished. We'll do this with the letters <code class="display"><span class="extract">A</span></code> to <code class="display"><span class="extract">Z</span></code>.
|
|
</p>
|
|
|
|
|
|
<pre class="definitions">
|
|
<span class="definitionkeyword">define</span> <span class="constant">GROSS_AMOUNT_OF_INDENTATION</span><span class="plain"> 26</span>
|
|
</pre>
|
|
|
|
<pre class="display">
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="functiontext">Lexer::indentation_level</span><span class="plain">(</span><span class="reserved">int</span><span class="plain"> </span><span class="identifier">wn</span><span class="plain">) {</span>
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="identifier">q</span><span class="plain"> = </span><span class="identifier">lw_array</span><span class="plain">[</span><span class="identifier">wn</span><span class="plain">]</span><span class="element">.lw_break</span><span class="plain"> - </span><span class="character">'A'</span><span class="plain"> + 1;</span>
|
|
<span class="reserved">if</span><span class="plain"> ((</span><span class="identifier">q</span><span class="plain"> >= 1) && (</span><span class="identifier">q</span><span class="plain"> <= </span><span class="constant">GROSS_AMOUNT_OF_INDENTATION</span><span class="plain">)) </span><span class="reserved">return</span><span class="plain"> </span><span class="identifier">q</span><span class="plain">;</span>
|
|
<span class="reserved">return</span><span class="plain"> 0;</span>
|
|
<span class="plain">}</span>
|
|
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="functiontext">Lexer::break_char_for_indents</span><span class="plain">(</span><span class="reserved">int</span><span class="plain"> </span><span class="identifier">t</span><span class="plain">) {</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">t</span><span class="plain"> <= 0) </span><span class="identifier">internal_error</span><span class="plain">(</span><span class="string">"bad indentation break"</span><span class="plain">);</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">t</span><span class="plain"> >= 26) </span><span class="reserved">return</span><span class="plain"> </span><span class="character">'Z'</span><span class="plain">;</span>
|
|
<span class="reserved">return</span><span class="plain"> </span><span class="character">'A'</span><span class="plain"> + </span><span class="identifier">t</span><span class="plain"> - 1;</span>
|
|
<span class="plain">}</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="endnote">The function Lexer::indentation_level is used in 3/wrd (<a href="3-wrd.html#SP21">§21</a>).</p>
|
|
|
|
<p class="endnote">The function Lexer::break_char_for_indents is used in <a href="#SP26_2">§26.2</a>.</p>
|
|
|
|
<p class="inwebparagraph"><a id="SP18"></a><b>§18. Access functions. </b></p>
|
|
|
|
|
|
<pre class="display">
|
|
<span class="reserved">vocabulary_entry</span><span class="plain"> *</span><span class="functiontext">Lexer::word</span><span class="plain">(</span><span class="reserved">int</span><span class="plain"> </span><span class="identifier">wn</span><span class="plain">) {</span>
|
|
<span class="reserved">return</span><span class="plain"> </span><span class="identifier">lw_array</span><span class="plain">[</span><span class="identifier">wn</span><span class="plain">]</span><span class="element">.lw_identity</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
|
|
<span class="reserved">void</span><span class="plain"> </span><span class="functiontext">Lexer::set_word</span><span class="plain">(</span><span class="reserved">int</span><span class="plain"> </span><span class="identifier">wn</span><span class="plain">, </span><span class="reserved">vocabulary_entry</span><span class="plain"> *</span><span class="identifier">ve</span><span class="plain">) {</span>
|
|
<span class="identifier">lw_array</span><span class="plain">[</span><span class="identifier">wn</span><span class="plain">]</span><span class="element">.lw_identity</span><span class="plain"> = </span><span class="identifier">ve</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="functiontext">Lexer::break_before</span><span class="plain">(</span><span class="reserved">int</span><span class="plain"> </span><span class="identifier">wn</span><span class="plain">) {</span>
|
|
<span class="reserved">return</span><span class="plain"> </span><span class="identifier">lw_array</span><span class="plain">[</span><span class="identifier">wn</span><span class="plain">]</span><span class="element">.lw_break</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
|
|
<span class="reserved">source_file</span><span class="plain"> *</span><span class="functiontext">Lexer::file_of_origin</span><span class="plain">(</span><span class="reserved">int</span><span class="plain"> </span><span class="identifier">wn</span><span class="plain">) {</span>
|
|
<span class="reserved">return</span><span class="plain"> </span><span class="identifier">lw_array</span><span class="plain">[</span><span class="identifier">wn</span><span class="plain">]</span><span class="element">.lw_source.file_of_origin</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
|
|
<span class="reserved">source_location</span><span class="plain"> </span><span class="functiontext">Lexer::word_location</span><span class="plain">(</span><span class="reserved">int</span><span class="plain"> </span><span class="identifier">wn</span><span class="plain">) {</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">wn</span><span class="plain"> < 0) {</span>
|
|
<span class="reserved">source_location</span><span class="plain"> </span><span class="identifier">nowhere</span><span class="plain">;</span>
|
|
<span class="identifier">nowhere</span><span class="element">.file_of_origin</span><span class="plain"> = </span><span class="identifier">NULL</span><span class="plain">;</span>
|
|
<span class="identifier">nowhere</span><span class="element">.line_number</span><span class="plain"> = 0;</span>
|
|
<span class="reserved">return</span><span class="plain"> </span><span class="identifier">nowhere</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
<span class="reserved">return</span><span class="plain"> </span><span class="identifier">lw_array</span><span class="plain">[</span><span class="identifier">wn</span><span class="plain">]</span><span class="element">.lw_source</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
|
|
<span class="reserved">void</span><span class="plain"> </span><span class="functiontext">Lexer::set_word_location</span><span class="plain">(</span><span class="reserved">int</span><span class="plain"> </span><span class="identifier">wn</span><span class="plain">, </span><span class="reserved">source_location</span><span class="plain"> </span><span class="identifier">sl</span><span class="plain">) {</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">wn</span><span class="plain"> < 0) </span><span class="identifier">internal_error</span><span class="plain">(</span><span class="string">"can't set word location"</span><span class="plain">);</span>
|
|
<span class="identifier">lw_array</span><span class="plain">[</span><span class="identifier">wn</span><span class="plain">]</span><span class="element">.lw_source</span><span class="plain"> = </span><span class="identifier">sl</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
|
|
<span class="identifier">wchar_t</span><span class="plain"> *</span><span class="functiontext">Lexer::word_raw_text</span><span class="plain">(</span><span class="reserved">int</span><span class="plain"> </span><span class="identifier">wn</span><span class="plain">) {</span>
|
|
<span class="reserved">return</span><span class="plain"> </span><span class="identifier">lw_array</span><span class="plain">[</span><span class="identifier">wn</span><span class="plain">]</span><span class="element">.lw_rawtext</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
|
|
<span class="reserved">void</span><span class="plain"> </span><span class="functiontext">Lexer::set_word_raw_text</span><span class="plain">(</span><span class="reserved">int</span><span class="plain"> </span><span class="identifier">wn</span><span class="plain">, </span><span class="identifier">wchar_t</span><span class="plain"> *</span><span class="identifier">rt</span><span class="plain">) {</span>
|
|
<span class="identifier">lw_array</span><span class="plain">[</span><span class="identifier">wn</span><span class="plain">]</span><span class="element">.lw_rawtext</span><span class="plain"> = </span><span class="identifier">rt</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
|
|
<span class="identifier">wchar_t</span><span class="plain"> *</span><span class="functiontext">Lexer::word_text</span><span class="plain">(</span><span class="reserved">int</span><span class="plain"> </span><span class="identifier">wn</span><span class="plain">) {</span>
|
|
<span class="reserved">return</span><span class="plain"> </span><span class="identifier">lw_array</span><span class="plain">[</span><span class="identifier">wn</span><span class="plain">]</span><span class="element">.lw_text</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
|
|
<span class="reserved">void</span><span class="plain"> </span><span class="functiontext">Lexer::set_word_text</span><span class="plain">(</span><span class="reserved">int</span><span class="plain"> </span><span class="identifier">wn</span><span class="plain">, </span><span class="identifier">wchar_t</span><span class="plain"> *</span><span class="identifier">rt</span><span class="plain">) {</span>
|
|
<span class="identifier">lw_array</span><span class="plain">[</span><span class="identifier">wn</span><span class="plain">]</span><span class="element">.lw_text</span><span class="plain"> = </span><span class="identifier">rt</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
|
|
<span class="reserved">void</span><span class="plain"> </span><span class="functiontext">Lexer::word_copy</span><span class="plain">(</span><span class="reserved">int</span><span class="plain"> </span><span class="identifier">to</span><span class="plain">, </span><span class="reserved">int</span><span class="plain"> </span><span class="identifier">from</span><span class="plain">) {</span>
|
|
<span class="identifier">lw_array</span><span class="plain">[</span><span class="identifier">to</span><span class="plain">] = </span><span class="identifier">lw_array</span><span class="plain">[</span><span class="identifier">from</span><span class="plain">];</span>
|
|
<span class="plain">}</span>
|
|
|
|
<span class="reserved">void</span><span class="plain"> </span><span class="functiontext">Lexer::writer</span><span class="plain">(</span><span class="identifier">OUTPUT_STREAM</span><span class="plain">, </span><span class="reserved">char</span><span class="plain"> *</span><span class="identifier">format_string</span><span class="plain">, </span><span class="reserved">int</span><span class="plain"> </span><span class="identifier">wn</span><span class="plain">) {</span>
|
|
<span class="reserved">if</span><span class="plain"> ((</span><span class="identifier">wn</span><span class="plain"> < 0) || (</span><span class="identifier">wn</span><span class="plain"> >= </span><span class="identifier">lexer_wordcount</span><span class="plain">)) </span><span class="reserved">return</span><span class="plain">;</span>
|
|
<span class="reserved">switch</span><span class="plain"> (</span><span class="identifier">format_string</span><span class="plain">[0]) {</span>
|
|
<span class="reserved">case</span><span class="plain"> </span><span class="character">'+'</span><span class="plain">: </span><span class="identifier">WRITE</span><span class="plain">(</span><span class="string">"%w"</span><span class="plain">, </span><span class="identifier">lw_array</span><span class="plain">[</span><span class="identifier">wn</span><span class="plain">]</span><span class="element">.lw_rawtext</span><span class="plain">); </span><span class="reserved">break</span><span class="plain">;</span>
|
|
<span class="reserved">case</span><span class="plain"> </span><span class="character">'~'</span><span class="plain">:</span>
|
|
<span class="functiontext">Word::compile_to_I6_dictionary</span><span class="plain">(</span><span class="identifier">OUT</span><span class="plain">, </span><span class="identifier">lw_array</span><span class="plain">[</span><span class="identifier">wn</span><span class="plain">]</span><span class="element">.lw_text</span><span class="plain">, </span><span class="identifier">FALSE</span><span class="plain">);</span>
|
|
<span class="reserved">break</span><span class="plain">;</span>
|
|
<span class="reserved">case</span><span class="plain"> </span><span class="character">'<'</span><span class="plain">:</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">STREAM_USES_UTF8</span><span class="plain">(</span><span class="identifier">OUT</span><span class="plain">)) </span><span class="identifier">Streams::enable_XML_escapes</span><span class="plain">(</span><span class="identifier">OUT</span><span class="plain">);</span>
|
|
<span class="identifier">WRITE</span><span class="plain">(</span><span class="string">"%w"</span><span class="plain">, </span><span class="identifier">lw_array</span><span class="plain">[</span><span class="identifier">wn</span><span class="plain">]</span><span class="element">.lw_rawtext</span><span class="plain">);</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">STREAM_USES_UTF8</span><span class="plain">(</span><span class="identifier">OUT</span><span class="plain">)) </span><span class="identifier">Streams::disable_XML_escapes</span><span class="plain">(</span><span class="identifier">OUT</span><span class="plain">);</span>
|
|
<span class="reserved">break</span><span class="plain">;</span>
|
|
<span class="reserved">case</span><span class="plain"> </span><span class="character">'N'</span><span class="plain">: </span><span class="identifier">WRITE</span><span class="plain">(</span><span class="string">"%w"</span><span class="plain">, </span><span class="identifier">lw_array</span><span class="plain">[</span><span class="identifier">wn</span><span class="plain">]</span><span class="element">.lw_text</span><span class="plain">); </span><span class="reserved">break</span><span class="plain">;</span>
|
|
<span class="reserved">default</span><span class="plain">: </span><span class="identifier">internal_error</span><span class="plain">(</span><span class="string">"bad %N extension"</span><span class="plain">);</span>
|
|
<span class="plain">}</span>
|
|
<span class="plain">}</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="endnote">The function Lexer::word is used in 2/vcb (<a href="2-vcb.html#SP7">§7</a>, <a href="2-vcb.html#SP11">§11</a>, <a href="2-vcb.html#SP12">§12</a>, <a href="2-vcb.html#SP17">§17</a>), 2/wa (<a href="2-wa.html#SP4">§4</a>, <a href="2-wa.html#SP9">§9</a>, <a href="2-wa.html#SP10">§10</a>, <a href="2-wa.html#SP11">§11</a>), 3/wrd (<a href="3-wrd.html#SP8">§8</a>, <a href="3-wrd.html#SP18">§18</a>, <a href="3-wrd.html#SP19">§19</a>, <a href="3-wrd.html#SP20">§20</a>), 3/tff (<a href="3-tff.html#SP4">§4</a>), 4/nw (<a href="4-nw.html#SP1">§1</a>), 4/prf (<a href="4-prf.html#SP21">§21</a>, <a href="4-prf.html#SP26">§26</a>, <a href="4-prf.html#SP26_2">§26.2</a>, <a href="4-prf.html#SP26_3">§26.3</a>, <a href="4-prf.html#SP28_1_1">§28.1.1</a>, <a href="4-prf.html#SP28_1_3">§28.1.3</a>, <a href="4-prf.html#SP29_2">§29.2</a>, <a href="4-prf.html#SP34">§34</a>, <a href="4-prf.html#SP35">§35</a>, <a href="4-prf.html#SP50_2_1_2_3_3_3">§50.2.1.2.3.3.3</a>, <a href="4-prf.html#SP52">§52</a>), 4/bn (<a href="4-bn.html#SP5">§5</a>).</p>
|
|
|
|
<p class="endnote">The function Lexer::set_word is used in 2/vcb (<a href="2-vcb.html#SP4">§4</a>).</p>
|
|
|
|
<p class="endnote">The function Lexer::break_before is used in 3/wrd (<a href="3-wrd.html#SP21">§21</a>), 4/nw (<a href="4-nw.html#SP4">§4</a>).</p>
|
|
|
|
<p class="endnote">The function Lexer::file_of_origin appears nowhere else.</p>
|
|
|
|
<p class="endnote">The function Lexer::word_location is used in 3/wrd (<a href="3-wrd.html#SP11">§11</a>).</p>
|
|
|
|
<p class="endnote">The function Lexer::set_word_location appears nowhere else.</p>
|
|
|
|
<p class="endnote">The function Lexer::word_raw_text is used in 2/vcb (<a href="2-vcb.html#SP4">§4</a>), 3/wrd (<a href="3-wrd.html#SP16">§16</a>, <a href="3-wrd.html#SP22_3">§22.3</a>, <a href="3-wrd.html#SP22_4">§22.4</a>), 4/nw (<a href="4-nw.html#SP1">§1</a>, <a href="4-nw.html#SP2">§2</a>, <a href="4-nw.html#SP5">§5</a>, <a href="4-nw.html#SP6">§6</a>, <a href="4-nw.html#SP7">§7</a>, <a href="4-nw.html#SP8">§8</a>), 4/prf (<a href="4-prf.html#SP29_1">§29.1</a>).</p>
|
|
|
|
<p class="endnote">The function Lexer::set_word_raw_text is used in 2/vcb (<a href="2-vcb.html#SP5">§5</a>), 4/nw (<a href="4-nw.html#SP8">§8</a>).</p>
|
|
|
|
<p class="endnote">The function Lexer::word_text is used in 2/vcb (<a href="2-vcb.html#SP4">§4</a>, <a href="2-vcb.html#SP7">§7</a>), 3/wrd (<a href="3-wrd.html#SP17">§17</a>), 3/tff (<a href="3-tff.html#SP4">§4</a>), 3/idn (<a href="3-idn.html#SP3">§3</a>), 4/nw (<a href="4-nw.html#SP2">§2</a>, <a href="4-nw.html#SP8">§8</a>).</p>
|
|
|
|
<p class="endnote">The function Lexer::set_word_text is used in 2/vcb (<a href="2-vcb.html#SP5">§5</a>), 4/nw (<a href="4-nw.html#SP8">§8</a>).</p>
|
|
|
|
<p class="endnote">The function Lexer::word_copy is used in <a href="#SP28">§28</a>.</p>
|
|
|
|
<p class="endnote">The function Lexer::writer is used in 1/wm (<a href="1-wm.html#SP3_2">§3.2</a>).</p>
|
|
|
|
<p class="inwebparagraph"><a id="SP19"></a><b>§19. Definition of white space. </b>The following macro (to save time over a function call) is highly dangerous,
|
|
and of the kind which all books on C counsel against. If it were called with
|
|
any argument whose evaluation had side-effects, disaster would ensue.
|
|
It is therefore used only twice, with care, and only in this section below.
|
|
</p>
|
|
|
|
|
|
<pre class="definitions">
|
|
<span class="definitionkeyword">define</span> <span class="identifier">is_whitespace</span><span class="plain">(</span><span class="identifier">c</span><span class="plain">) ((</span><span class="identifier">c</span><span class="plain"> == </span><span class="character">' '</span><span class="plain">) || (</span><span class="identifier">c</span><span class="plain"> == </span><span class="character">'\</span><span class="plain">n</span><span class="character">'</span><span class="plain">) || (</span><span class="identifier">c</span><span class="plain"> == </span><span class="character">'\</span><span class="plain">t</span><span class="character">'</span><span class="plain">))</span>
|
|
</pre>
|
|
<p class="inwebparagraph"><a id="SP20"></a><b>§20. Internal lexer states. </b>The current situation of the lexer is specified by the collective values
|
|
of all of the following. First, the start of the current word being
|
|
recorded, and the current high water mark — those are defined above.
|
|
Second, we need the feeder machinery to maintain a variable telling us
|
|
the previous character in the raw, un-respaced source. We need to be a
|
|
little careful about the type of this: it needs to be an <code class="display"><span class="extract">int</span></code> so that it
|
|
can on occasion hold the pseudo-character value <code class="display"><span class="extract">EOF</span></code>.
|
|
</p>
|
|
|
|
|
|
<pre class="display">
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="identifier">lxs_previous_char_in_raw_feed</span><span class="plain">; </span> <span class="comment">Preceding character in raw file read</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="inwebparagraph"><a id="SP21"></a><b>§21. </b>There are four kinds of word: ordinary words, [comments in square brackets],
|
|
"strings in double quotes," and <code class="display"><span class="extract">(- I6_inclusion_text -)</span></code>. The latter
|
|
three are kinds are collectively called literals. As each word is read,
|
|
the variable <code class="display"><span class="extract">lxs_kind_of_word</span></code> holds what it is currently believed to be.
|
|
</p>
|
|
|
|
|
|
<pre class="definitions">
|
|
<span class="definitionkeyword">define</span> <span class="constant">ORDINARY_KW</span><span class="plain"> 0</span>
|
|
<span class="definitionkeyword">define</span> <span class="constant">COMMENT_KW</span><span class="plain"> 1</span>
|
|
<span class="definitionkeyword">define</span> <span class="constant">STRING_KW</span><span class="plain"> 2</span>
|
|
<span class="definitionkeyword">define</span> <span class="constant">I6_INCLUSION_KW</span><span class="plain"> 3</span>
|
|
</pre>
|
|
|
|
<pre class="display">
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="identifier">lxs_kind_of_word</span><span class="plain">; </span> <span class="comment">One of the defined values above</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="inwebparagraph"><a id="SP22"></a><b>§22. </b>While there are a pile of state variables below, the basic situation is that
|
|
the lexer has two main modes: ordinary mode and literal mode, determined
|
|
by whether <code class="display"><span class="extract">lxs_literal_mode</span></code> is false or true. It might look as if this
|
|
variable is redundant — can't we simply see whether <code class="display"><span class="extract">lxs_kind_of_word</span></code>
|
|
is <code class="display"><span class="extract">ORDINARY_KW</span></code> or not? — but in fact we return to ordinary mode slightly
|
|
before we finish recording a literal, as we shall see, so it is important
|
|
to be able to switch in and out of literal mode without changing the kind
|
|
of word.
|
|
</p>
|
|
|
|
|
|
<pre class="display">
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="identifier">lxs_literal_mode</span><span class="plain">; </span> <span class="comment">Are we in literal or ordinary mode?</span>
|
|
|
|
<span class="comment">significant in ordinary mode:</span>
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="identifier">lxs_most_significant_space_char</span><span class="plain">; </span> <span class="comment">Most significant whitespace character preceding</span>
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="identifier">lxs_number_of_tab_stops</span><span class="plain">; </span> <span class="comment">Number of consecutive tabs</span>
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="identifier">lxs_this_line_is_empty_so_far</span><span class="plain">; </span> <span class="comment">Current line white space so far?</span>
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="identifier">lxs_this_word_is_empty_so_far</span><span class="plain">; </span> <span class="comment">Looking for a word to start?</span>
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="identifier">lxs_scanning_text_substitution</span><span class="plain">; </span> <span class="comment">Used to break up strings at [substitutions]</span>
|
|
|
|
<span class="comment">significant in literal mode:</span>
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="identifier">lxs_comment_nesting</span><span class="plain">; </span> <span class="comment">For square brackets within square brackets</span>
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="identifier">lxs_string_soak_up_spaces_mode</span><span class="plain">; </span> <span class="comment">Used to fold strings which break across lines</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="inwebparagraph"><a id="SP23"></a><b>§23. </b>The lexer needs to be reset each time it is used on a given feed of text,
|
|
whether from a file or internally. Note that this resets both external
|
|
and internal states to their defaults (the default for external states
|
|
always being "off").
|
|
</p>
|
|
|
|
|
|
<pre class="display">
|
|
<span class="reserved">void</span><span class="plain"> </span><span class="functiontext">Lexer::reset_lexer</span><span class="plain">(</span><span class="reserved">void</span><span class="plain">) {</span>
|
|
<span class="identifier">lexer_word</span><span class="plain"> = </span><span class="identifier">lexer_hwm</span><span class="plain">;</span>
|
|
<span class="identifier">lxs_previous_char_in_raw_feed</span><span class="plain"> = </span><span class="identifier">EOF</span><span class="plain">;</span>
|
|
|
|
<span class="comment">reset the external states</span>
|
|
<span class="identifier">lexer_wait_for_dashes</span><span class="plain"> = </span><span class="identifier">FALSE</span><span class="plain">;</span>
|
|
<span class="identifier">lexer_punctuation_marks</span><span class="plain"> = </span><span class="constant">STANDARD_PUNCTUATION_MARKS</span><span class="plain">;</span>
|
|
<span class="identifier">lexer_divide_strings_at_text_substitutions</span><span class="plain"> = </span><span class="identifier">FALSE</span><span class="plain">;</span>
|
|
<span class="identifier">lexer_allow_I6_escapes</span><span class="plain"> = </span><span class="identifier">TRUE</span><span class="plain">;</span>
|
|
|
|
<span class="comment">reset the internal states</span>
|
|
<span class="identifier">lxs_most_significant_space_char</span><span class="plain"> = </span><span class="character">'\</span><span class="plain">n</span><span class="character">'</span><span class="plain">; </span> <span class="comment">we imagine each lexer feed starting a new line</span>
|
|
<span class="identifier">lxs_number_of_tab_stops</span><span class="plain"> = 0; </span> <span class="comment">but not yet indented with tabs</span>
|
|
|
|
<span class="identifier">lxs_this_line_is_empty_so_far</span><span class="plain"> = </span><span class="identifier">TRUE</span><span class="plain">; </span> <span class="comment">clearly</span>
|
|
<span class="identifier">lxs_this_word_is_empty_so_far</span><span class="plain"> = </span><span class="identifier">TRUE</span><span class="plain">; </span> <span class="comment">likewise</span>
|
|
|
|
<span class="identifier">lxs_literal_mode</span><span class="plain"> = </span><span class="identifier">FALSE</span><span class="plain">; </span> <span class="comment">begin in ordinary mode...</span>
|
|
<span class="identifier">lxs_kind_of_word</span><span class="plain"> = </span><span class="constant">ORDINARY_KW</span><span class="plain">; </span> <span class="comment">...expecting an ordinary word</span>
|
|
<span class="identifier">lxs_string_soak_up_spaces_mode</span><span class="plain"> = </span><span class="identifier">FALSE</span><span class="plain">;</span>
|
|
<span class="identifier">lxs_scanning_text_substitution</span><span class="plain"> = </span><span class="identifier">FALSE</span><span class="plain">;</span>
|
|
<span class="identifier">lxs_comment_nesting</span><span class="plain"> = 0;</span>
|
|
<span class="plain">}</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="endnote">The function Lexer::reset_lexer is used in <a href="#SP24">§24</a>.</p>
|
|
|
|
<p class="inwebparagraph"><a id="SP24"></a><b>§24. Feeding the lexer. </b>The lexer takes its input as a stream of characters, sent from a "feeder
|
|
routine": there are two of these, one sending the stream from a file, the
|
|
other from a C string. A feeder routine is required to:
|
|
</p>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<ul class="items"><li>(1) call <code class="display"><span class="extract">Lexer::feed_begins</span></code> before sending the first character,
|
|
</li></ul>
|
|
<ul class="items"><li>(2) send ISO Latin-1 characters which also exist in ZSCII, in sequence,
|
|
via <code class="display"><span class="extract">Lexer::feed_triplet</span></code>,
|
|
</li></ul>
|
|
<ul class="items"><li>(3) conclude by calling <code class="display"><span class="extract">Lexer::feed_ends</span></code>.
|
|
</li></ul>
|
|
<p class="inwebparagraph">Only one feeder can be active at a time, as the following routines ensure.
|
|
</p>
|
|
|
|
|
|
<pre class="display">
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="identifier">lexer_feed_started_at</span><span class="plain"> = -1;</span>
|
|
|
|
<span class="reserved">void</span><span class="plain"> </span><span class="functiontext">Lexer::feed_begins</span><span class="plain">(</span><span class="reserved">source_location</span><span class="plain"> </span><span class="identifier">sl</span><span class="plain">) {</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">lexer_feed_started_at</span><span class="plain"> >= 0) </span><span class="identifier">internal_error</span><span class="plain">(</span><span class="string">"one lexer feeder interrupted another"</span><span class="plain">);</span>
|
|
<span class="identifier">lexer_feed_started_at</span><span class="plain"> = </span><span class="identifier">lexer_wordcount</span><span class="plain">;</span>
|
|
<span class="identifier">lexer_position</span><span class="plain"> = </span><span class="identifier">sl</span><span class="plain">;</span>
|
|
<span class="functiontext">Lexer::reset_lexer</span><span class="plain">();</span>
|
|
<span class="identifier">LOGIF</span><span class="plain">(</span><span class="identifier">LEXICAL_OUTPUT</span><span class="plain">, </span><span class="string">"Lexer feed began at %d\</span><span class="plain">n</span><span class="string">"</span><span class="plain">, </span><span class="identifier">lexer_feed_started_at</span><span class="plain">);</span>
|
|
<span class="plain">}</span>
|
|
|
|
<span class="reserved">wording</span><span class="plain"> </span><span class="functiontext">Lexer::feed_ends</span><span class="plain">(</span><span class="reserved">int</span><span class="plain"> </span><span class="identifier">extra_padding</span><span class="plain">, </span><span class="identifier">text_stream</span><span class="plain"> *</span><span class="identifier">problem_source_description</span><span class="plain">) {</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">lexer_feed_started_at</span><span class="plain"> == -1) </span><span class="identifier">internal_error</span><span class="plain">(</span><span class="string">"lexer feeder ended without starting"</span><span class="plain">);</span>
|
|
|
|
<<span class="cwebmacro">Feed whitespace as padding</span> <span class="cwebmacronumber">24.1</span>><span class="plain">;</span>
|
|
|
|
<span class="reserved">wording</span><span class="plain"> </span><span class="identifier">RRW</span><span class="plain"> = </span><span class="constant">EMPTY_WORDING</span><span class="plain">;</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">lexer_feed_started_at</span><span class="plain"> < </span><span class="identifier">lexer_wordcount</span><span class="plain">)</span>
|
|
<span class="identifier">RRW</span><span class="plain"> = </span><span class="functiontext">Wordings::new</span><span class="plain">(</span><span class="identifier">lexer_feed_started_at</span><span class="plain">, </span><span class="identifier">lexer_wordcount</span><span class="plain">-1);</span>
|
|
<span class="identifier">lexer_feed_started_at</span><span class="plain"> = -1;</span>
|
|
<span class="identifier">LOGIF</span><span class="plain">(</span><span class="identifier">LEXICAL_OUTPUT</span><span class="plain">, </span><span class="string">"Lexer feed ended at %d\</span><span class="plain">n</span><span class="string">"</span><span class="plain">, </span><span class="functiontext">Wordings::first_wn</span><span class="plain">(</span><span class="identifier">RRW</span><span class="plain">));</span>
|
|
<<span class="cwebmacro">Issue Problem messages if feed ended in the middle of quoted text, comment or verbatim I6</span> <span class="cwebmacronumber">24.3</span>><span class="plain">;</span>
|
|
<span class="reserved">return</span><span class="plain"> </span><span class="identifier">RRW</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="endnote">The function Lexer::feed_begins is used in 3/tff (<a href="3-tff.html#SP2">§2</a>), 3/fds (<a href="3-fds.html#SP5">§5</a>).</p>
|
|
|
|
<p class="endnote">The function Lexer::feed_ends is used in 3/tff (<a href="3-tff.html#SP2">§2</a>), 3/fds (<a href="3-fds.html#SP5">§5</a>).</p>
|
|
|
|
<p class="inwebparagraph"><a id="SP24_1"></a><b>§24.1. </b>White space padding guarantees that a word running right up to the end of
|
|
the feed will be processed, since (outside literal mode) that white space
|
|
signals to the lexer that a word is complete. (If we are in literal mode at
|
|
the end of the feed, problem messages are produced. We code NI to ensure
|
|
that this never occurs when feeding our own C strings through.)
|
|
</p>
|
|
|
|
<p class="inwebparagraph">At the end of each complete file, we also want to ensure there is always a
|
|
paragraph break, because this simplifies the parsing of headings (which in
|
|
turn is because a file boundary counts as a super-heading-break, and headings
|
|
are only detected as stand-alone paragraphs). We add a bit more white
|
|
space than is strictly necessary, because it saves worrying about whether
|
|
it is safe to look ahead to characters further on in the lexer's workspace
|
|
when we are close to the high water mark, and because it means that a source
|
|
file which is empty or contains only a byte-order marker comes out as at
|
|
least one paragraph, even if a blank one.
|
|
</p>
|
|
|
|
|
|
<p class="macrodefinition"><code class="display">
|
|
<<span class="cwebmacrodefn">Feed whitespace as padding</span> <span class="cwebmacronumber">24.1</span>> =
|
|
</code></p>
|
|
|
|
|
|
<pre class="displaydefn">
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">extra_padding</span><span class="plain"> == </span><span class="identifier">FALSE</span><span class="plain">) {</span>
|
|
<span class="functiontext">Lexer::feed_char_into_lexer</span><span class="plain">(</span><span class="character">' '</span><span class="plain">);</span>
|
|
<span class="plain">} </span><span class="reserved">else</span><span class="plain"> {</span>
|
|
<span class="functiontext">Lexer::feed_char_into_lexer</span><span class="plain">(</span><span class="character">' '</span><span class="plain">);</span>
|
|
<span class="functiontext">Lexer::feed_char_into_lexer</span><span class="plain">(</span><span class="character">'\</span><span class="plain">n</span><span class="character">'</span><span class="plain">);</span>
|
|
<span class="functiontext">Lexer::feed_char_into_lexer</span><span class="plain">(</span><span class="character">'\</span><span class="plain">n</span><span class="character">'</span><span class="plain">);</span>
|
|
<span class="functiontext">Lexer::feed_char_into_lexer</span><span class="plain">(</span><span class="character">'\</span><span class="plain">n</span><span class="character">'</span><span class="plain">);</span>
|
|
<span class="functiontext">Lexer::feed_char_into_lexer</span><span class="plain">(</span><span class="character">'\</span><span class="plain">n</span><span class="character">'</span><span class="plain">);</span>
|
|
<span class="functiontext">Lexer::feed_char_into_lexer</span><span class="plain">(</span><span class="character">' '</span><span class="plain">);</span>
|
|
<span class="plain">}</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="endnote">This code is used in <a href="#SP24">§24</a>.</p>
|
|
|
|
<p class="inwebparagraph"><a id="SP24_2"></a><b>§24.2. </b>These problem messages can, of course, never result from text which NI
|
|
is feeding into the lexer itself, independently of source files. That would
|
|
be a bug, and NI is bug-free, so it follows that it could never happen.
|
|
</p>
|
|
|
|
|
|
<pre class="definitions">
|
|
<span class="definitionkeyword">enum</span> <span class="constant">MEMORY_OUT_LEXERERROR</span><span class="definitionkeyword"> from </span><span class="constant">0</span>
|
|
<span class="definitionkeyword">enum</span> <span class="constant">STRING_NEVER_ENDS_LEXERERROR</span>
|
|
<span class="definitionkeyword">enum</span> <span class="constant">COMMENT_NEVER_ENDS_LEXERERROR</span>
|
|
<span class="definitionkeyword">enum</span> <span class="constant">I6_NEVER_ENDS_LEXERERROR</span>
|
|
</pre>
|
|
<p class="inwebparagraph"><a id="SP24_3"></a><b>§24.3. </b><code class="display">
|
|
<<span class="cwebmacrodefn">Issue Problem messages if feed ended in the middle of quoted text, comment or verbatim I6</span> <span class="cwebmacronumber">24.3</span>> =
|
|
</code></p>
|
|
|
|
|
|
<pre class="displaydefn">
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">lxs_kind_of_word</span><span class="plain"> != </span><span class="constant">ORDINARY_KW</span><span class="plain">) {</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">lexer_wordcount</span><span class="plain"> >= 20) {</span>
|
|
<span class="identifier">LOG</span><span class="plain">(</span><span class="string">"Last words: %W\</span><span class="plain">n</span><span class="string">"</span><span class="plain">, </span><span class="functiontext">Wordings::new</span><span class="plain">(</span><span class="identifier">lexer_wordcount</span><span class="plain">-20, </span><span class="identifier">lexer_wordcount</span><span class="plain">-1));</span>
|
|
<span class="plain">} </span><span class="reserved">else</span><span class="plain"> </span><span class="reserved">if</span><span class="plain"> (</span><span class="identifier">lexer_wordcount</span><span class="plain"> >= 1) {</span>
|
|
<span class="identifier">LOG</span><span class="plain">(</span><span class="string">"Last words: %W\</span><span class="plain">n</span><span class="string">"</span><span class="plain">, </span><span class="functiontext">Wordings::new</span><span class="plain">(0, </span><span class="identifier">lexer_wordcount</span><span class="plain">-1));</span>
|
|
<span class="plain">} </span><span class="reserved">else</span><span class="plain"> {</span>
|
|
<span class="identifier">LOG</span><span class="plain">(</span><span class="string">"No words recorded\</span><span class="plain">n</span><span class="string">"</span><span class="plain">);</span>
|
|
<span class="plain">}</span>
|
|
<span class="plain">}</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">lxs_kind_of_word</span><span class="plain"> == </span><span class="constant">STRING_KW</span><span class="plain">)</span>
|
|
<span class="identifier">LEXER_PROBLEM_HANDLER</span><span class="plain">(</span><span class="constant">STRING_NEVER_ENDS_LEXERERROR</span><span class="plain">, </span><span class="identifier">problem_source_description</span><span class="plain">, </span><span class="identifier">NULL</span><span class="plain">);</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">lxs_kind_of_word</span><span class="plain"> == </span><span class="constant">COMMENT_KW</span><span class="plain">)</span>
|
|
<span class="identifier">LEXER_PROBLEM_HANDLER</span><span class="plain">(</span><span class="constant">COMMENT_NEVER_ENDS_LEXERERROR</span><span class="plain">, </span><span class="identifier">problem_source_description</span><span class="plain">, </span><span class="identifier">NULL</span><span class="plain">);</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">lxs_kind_of_word</span><span class="plain"> == </span><span class="constant">I6_INCLUSION_KW</span><span class="plain">)</span>
|
|
<span class="identifier">LEXER_PROBLEM_HANDLER</span><span class="plain">(</span><span class="constant">I6_NEVER_ENDS_LEXERERROR</span><span class="plain">, </span><span class="identifier">problem_source_description</span><span class="plain">, </span><span class="identifier">NULL</span><span class="plain">);</span>
|
|
<span class="identifier">lxs_kind_of_word</span><span class="plain"> = </span><span class="constant">ORDINARY_KW</span><span class="plain">;</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="endnote">This code is used in <a href="#SP24">§24</a>.</p>
|
|
|
|
<p class="inwebparagraph"><a id="SP25"></a><b>§25. </b>The feeder routine is required to send us a triple each time: <code class="display"><span class="extract">cr</span></code>
|
|
must be a valid character (see above) and may not be <code class="display"><span class="extract">EOF</span></code>; <code class="display"><span class="extract">last_cr</span></code> must
|
|
be the previous one or else perhaps <code class="display"><span class="extract">EOF</span></code> at the start of feed;
|
|
while <code class="display"><span class="extract">next_cr</span></code> must be the next or else perhaps <code class="display"><span class="extract">EOF</span></code> at the end of feed.
|
|
</p>
|
|
|
|
<p class="inwebparagraph">Spaces, often redundant, are inserted around punctuation unless one of the
|
|
following exceptions holds:
|
|
</p>
|
|
|
|
<p class="inwebparagraph">The lexer is in literal mode (inside strings, for instance);
|
|
</p>
|
|
|
|
<p class="inwebparagraph">Where a single punctuation mark occurs in between two digits, or between
|
|
a digit and a minus sign, or (in the case of full stops) between two lower-case
|
|
alphanumeric characters. This is done so that, for instance, "0.91" does
|
|
not split into three words in the lexer. We do not count square brackets
|
|
here, because if we did, that would cause trouble in parsing
|
|
</p>
|
|
|
|
<blockquote>
|
|
<p>say "[if M is less than 10]0[otherwise]1";</p>
|
|
|
|
</blockquote>
|
|
|
|
<p class="inwebparagraph">where the <code class="display"><span class="extract">0]0</span></code> would go unbroken in <code class="display"><span class="extract">lexer_divide_strings_at_text_substitutions</span></code>
|
|
mode, and therefore the <code class="display"><span class="extract">]</span></code> would remain glued to the preceding text;
|
|
</p>
|
|
|
|
<p class="inwebparagraph">Where the character following is a slash. (This is done essentially to make
|
|
most common URLs glue up as single words.)
|
|
</p>
|
|
|
|
|
|
<pre class="display">
|
|
<span class="reserved">void</span><span class="plain"> </span><span class="functiontext">Lexer::feed_triplet</span><span class="plain">(</span><span class="reserved">int</span><span class="plain"> </span><span class="identifier">last_cr</span><span class="plain">, </span><span class="reserved">int</span><span class="plain"> </span><span class="identifier">cr</span><span class="plain">, </span><span class="reserved">int</span><span class="plain"> </span><span class="identifier">next_cr</span><span class="plain">) {</span>
|
|
<span class="identifier">lxs_previous_char_in_raw_feed</span><span class="plain"> = </span><span class="identifier">last_cr</span><span class="plain">;</span>
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="identifier">space</span><span class="plain"> = </span><span class="identifier">FALSE</span><span class="plain">;</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="functiontext">Lexer::is_punctuation</span><span class="plain">(</span><span class="identifier">cr</span><span class="plain">)) </span><span class="identifier">space</span><span class="plain"> = </span><span class="identifier">TRUE</span><span class="plain">;</span>
|
|
<span class="reserved">if</span><span class="plain"> ((</span><span class="identifier">space</span><span class="plain">) && (</span><span class="identifier">lxs_literal_mode</span><span class="plain">)) </span><span class="identifier">space</span><span class="plain"> = </span><span class="identifier">FALSE</span><span class="plain">;</span>
|
|
<span class="reserved">if</span><span class="plain"> ((</span><span class="identifier">space</span><span class="plain">) && (</span><span class="identifier">cr</span><span class="plain"> != </span><span class="character">'['</span><span class="plain">) && (</span><span class="identifier">cr</span><span class="plain"> != </span><span class="character">']'</span><span class="plain">)) {</span>
|
|
<span class="reserved">if</span><span class="plain"> ((</span><span class="identifier">space</span><span class="plain">) && (</span><span class="identifier">next_cr</span><span class="plain"> == </span><span class="character">'/'</span><span class="plain">)) </span><span class="identifier">space</span><span class="plain"> = </span><span class="identifier">FALSE</span><span class="plain">;</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">space</span><span class="plain">) {</span>
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="identifier">lc</span><span class="plain"> = 0, </span><span class="identifier">nc</span><span class="plain"> = 0;</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">Characters::isdigit</span><span class="plain">(</span><span class="identifier">last_cr</span><span class="plain">)) </span><span class="identifier">lc</span><span class="plain"> = 1;</span>
|
|
<span class="reserved">if</span><span class="plain"> ((</span><span class="identifier">last_cr</span><span class="plain"> >= </span><span class="character">'a'</span><span class="plain">) && (</span><span class="identifier">last_cr</span><span class="plain"> <= </span><span class="character">'z'</span><span class="plain">)) </span><span class="identifier">lc</span><span class="plain"> = 2;</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">Characters::isdigit</span><span class="plain">(</span><span class="identifier">next_cr</span><span class="plain">)) </span><span class="identifier">nc</span><span class="plain"> = 1;</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">next_cr</span><span class="plain"> == </span><span class="character">'-'</span><span class="plain">) </span><span class="identifier">nc</span><span class="plain"> = 1;</span>
|
|
<span class="reserved">if</span><span class="plain"> ((</span><span class="identifier">next_cr</span><span class="plain"> >= </span><span class="character">'a'</span><span class="plain">) && (</span><span class="identifier">next_cr</span><span class="plain"> <= </span><span class="character">'z'</span><span class="plain">)) </span><span class="identifier">nc</span><span class="plain"> = 2;</span>
|
|
<span class="reserved">if</span><span class="plain"> ((</span><span class="identifier">lc</span><span class="plain"> == 1) && (</span><span class="identifier">nc</span><span class="plain"> == 1)) </span><span class="identifier">space</span><span class="plain"> = </span><span class="identifier">FALSE</span><span class="plain">;</span>
|
|
<span class="reserved">if</span><span class="plain"> ((</span><span class="identifier">cr</span><span class="plain"> == </span><span class="character">'.'</span><span class="plain">) && (</span><span class="identifier">lc</span><span class="plain"> > 0) && (</span><span class="identifier">nc</span><span class="plain"> > 0)) </span><span class="identifier">space</span><span class="plain"> = </span><span class="identifier">FALSE</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
<span class="plain">}</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">space</span><span class="plain">) {</span>
|
|
<span class="functiontext">Lexer::feed_char_into_lexer</span><span class="plain">(</span><span class="character">' '</span><span class="plain">);</span>
|
|
<span class="functiontext">Lexer::feed_char_into_lexer</span><span class="plain">(</span><span class="identifier">cr</span><span class="plain">); </span> <span class="comment">which might take us into literal mode, so to be careful...</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">lxs_literal_mode</span><span class="plain"> == </span><span class="identifier">FALSE</span><span class="plain">) </span><span class="functiontext">Lexer::feed_char_into_lexer</span><span class="plain">(</span><span class="character">' '</span><span class="plain">);</span>
|
|
<span class="plain">} </span><span class="reserved">else</span><span class="plain"> </span><span class="functiontext">Lexer::feed_char_into_lexer</span><span class="plain">(</span><span class="identifier">cr</span><span class="plain">);</span>
|
|
|
|
<span class="reserved">if</span><span class="plain"> ((</span><span class="identifier">cr</span><span class="plain"> == </span><span class="character">'\</span><span class="plain">n</span><span class="character">'</span><span class="plain">) && (</span><span class="identifier">lexer_position</span><span class="element">.file_of_origin</span><span class="plain">))</span>
|
|
<span class="identifier">lexer_position</span><span class="element">.line_number</span><span class="plain">++;</span>
|
|
<span class="plain">}</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="endnote">The function Lexer::feed_triplet is used in 3/tff (<a href="3-tff.html#SP2">§2</a>), 3/fds (<a href="3-fds.html#SP5">§5</a>).</p>
|
|
|
|
<p class="inwebparagraph"><a id="SP26"></a><b>§26. Lexing one character at a time. </b>We can think of characters as a stream of differently-coloured marbles,
|
|
flowing from various sources into a hopper above our marble-sorting
|
|
machine. The hopper lets the marbles drop through one at a time into the
|
|
mechanism below, but inserts transparent glass marbles of its own on either
|
|
side of certain colours of marble, so that the sequence of marbles entering
|
|
the mechanism is no longer the same as that which entered the hopper.
|
|
Moreover, the mechanism can itself cause extra marbles of its choice to
|
|
drop in from time to time, further interrupting the original flow.
|
|
</p>
|
|
|
|
<p class="inwebparagraph">The following routine is the mechanism which receives the marbles. We want
|
|
the marbles to run swiftly through and either be pulverised to glass
|
|
powder, or dropped into the output bucket, as the mechanism chooses.
|
|
(Whatever marbles from the original source survive will always emerge in
|
|
their original order, though.) Every so often the mechanism decides that it
|
|
has completed one batch, and moves on to dropping marbles into the next
|
|
bucket.
|
|
</p>
|
|
|
|
<p class="inwebparagraph">The marbles are characters; transparent glass ones are whitespace, which
|
|
will always now be <code class="display"><span class="extract">' '</span></code>, <code class="display"><span class="extract">'\t'</span></code> or <code class="display"><span class="extract">'\n'</span></code>; the routine
|
|
<code class="display"><span class="extract">Lexer::feed_triplet</span></code> above was the hopper; the routine
|
|
<code class="display"><span class="extract">Lexer::feed_char_into_lexer</span></code>, which occupies the whole of the rest of this
|
|
section, is the mechanism which takes each marble in turn. (On occasion it
|
|
calls itself recursively to cause extra characters of its choice to drop
|
|
in.) The batches are words, and the bucket receiving the surviving marbles
|
|
is the sequence of characters starting at <code class="display"><span class="extract">lexer_word</span></code> and extending to
|
|
<code class="display"><span class="extract">lexer_hwm-1</span></code>.
|
|
</p>
|
|
|
|
|
|
<pre class="display">
|
|
<span class="reserved">void</span><span class="plain"> </span><span class="functiontext">Lexer::feed_char_into_lexer</span><span class="plain">(</span><span class="reserved">int</span><span class="plain"> </span><span class="identifier">c</span><span class="plain">) {</span>
|
|
<span class="functiontext">Lexer::ensure_lexer_hwm_can_be_raised_by</span><span class="plain">(</span><span class="constant">MAX_WORD_LENGTH</span><span class="plain">, </span><span class="identifier">TRUE</span><span class="plain">);</span>
|
|
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">lxs_literal_mode</span><span class="plain">) {</span>
|
|
<<span class="cwebmacro">Contemplate leaving literal mode</span> <span class="cwebmacronumber">26.7</span>><span class="plain">;</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">lxs_kind_of_word</span><span class="plain"> == </span><span class="constant">STRING_KW</span><span class="plain">) {</span>
|
|
<<span class="cwebmacro">Force string division at the start of a text substitution, if necessary</span> <span class="cwebmacronumber">26.8</span>><span class="plain">;</span>
|
|
<<span class="cwebmacro">Soak up whitespace around line breaks inside a literal string</span> <span class="cwebmacronumber">26.4</span>><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
<span class="plain">}</span>
|
|
|
|
<span class="comment">whitespace outside literal mode ends any partly built word and need not be recorded</span>
|
|
<span class="reserved">if</span><span class="plain"> ((</span><span class="identifier">lxs_literal_mode</span><span class="plain"> == </span><span class="identifier">FALSE</span><span class="plain">) && (</span><span class="identifier">is_whitespace</span><span class="plain">(</span><span class="identifier">c</span><span class="plain">))) {</span>
|
|
<<span class="cwebmacro">Admire the texture of the whitespace</span> <span class="cwebmacronumber">26.1</span>><span class="plain">;</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">lexer_word</span><span class="plain"> != </span><span class="identifier">lexer_hwm</span><span class="plain">) </span><<span class="cwebmacro">Complete the current word</span> <span class="cwebmacronumber">26.5</span>><span class="plain">;</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">c</span><span class="plain"> == </span><span class="character">'\</span><span class="plain">n</span><span class="character">'</span><span class="plain">) </span><<span class="cwebmacro">Line break outside a literal</span> <span class="cwebmacronumber">26.3</span>><span class="plain">;</span>
|
|
<span class="reserved">return</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
|
|
<span class="comment">otherwise record the current character as part of the word being built</span>
|
|
<span class="plain">*(</span><span class="identifier">lexer_hwm</span><span class="plain">++) = </span><span class="identifier">c</span><span class="plain">;</span>
|
|
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">lxs_scanning_text_substitution</span><span class="plain">) {</span>
|
|
<<span class="cwebmacro">Force string division at the end of a text substitution, if necessary</span> <span class="cwebmacronumber">26.9</span>><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">lxs_this_word_is_empty_so_far</span><span class="plain">) {</span>
|
|
<<span class="cwebmacro">Look at recent whitespace to see what break it followed</span> <span class="cwebmacronumber">26.2</span>><span class="plain">;</span>
|
|
<<span class="cwebmacro">Contemplate entering literal mode</span> <span class="cwebmacronumber">26.6</span>><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
|
|
<span class="identifier">lxs_this_word_is_empty_so_far</span><span class="plain"> = </span><span class="identifier">FALSE</span><span class="plain">;</span>
|
|
<span class="identifier">lxs_this_line_is_empty_so_far</span><span class="plain"> = </span><span class="identifier">FALSE</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="endnote">The function Lexer::feed_char_into_lexer is used in <a href="#SP24_1">§24.1</a>, <a href="#SP25">§25</a>, <a href="#SP26_3">§26.3</a>, <a href="#SP26_8">§26.8</a>, <a href="#SP26_9">§26.9</a>.</p>
|
|
|
|
<p class="inwebparagraph"><a id="SP26_1"></a><b>§26.1. Dealing with whitespace. </b>Let's deal with the different textures of whitespace first, as these are
|
|
surprisingly rich all by themselves.
|
|
</p>
|
|
|
|
<p class="inwebparagraph">The following keeps track of the biggest white space character it has seen
|
|
of late, ranking newlines bigger than tabs, which are in turn bigger than
|
|
spaces; and it counts up the number of tabs it has seen (cancelling
|
|
back to none if a newline is found).
|
|
</p>
|
|
|
|
|
|
<p class="macrodefinition"><code class="display">
|
|
<<span class="cwebmacrodefn">Admire the texture of the whitespace</span> <span class="cwebmacronumber">26.1</span>> =
|
|
</code></p>
|
|
|
|
|
|
<pre class="displaydefn">
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">c</span><span class="plain"> == </span><span class="character">'\</span><span class="plain">t</span><span class="character">'</span><span class="plain">) {</span>
|
|
<span class="identifier">lxs_number_of_tab_stops</span><span class="plain">++;</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">lxs_most_significant_space_char</span><span class="plain"> != </span><span class="character">'\</span><span class="plain">n</span><span class="character">'</span><span class="plain">) </span><span class="identifier">lxs_most_significant_space_char</span><span class="plain"> = </span><span class="character">'\</span><span class="plain">t</span><span class="character">'</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">c</span><span class="plain"> == </span><span class="character">'\</span><span class="plain">n</span><span class="character">'</span><span class="plain">) {</span>
|
|
<span class="identifier">lxs_number_of_tab_stops</span><span class="plain"> = 0;</span>
|
|
<span class="identifier">lxs_most_significant_space_char</span><span class="plain"> = </span><span class="character">'\</span><span class="plain">n</span><span class="character">'</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="endnote">This code is used in <a href="#SP26">§26</a>.</p>
|
|
|
|
<p class="inwebparagraph"><a id="SP26_2"></a><b>§26.2. </b>To recall: we need to know what kind of whitespace prefaces each word
|
|
the lexer records.
|
|
</p>
|
|
|
|
<p class="inwebparagraph">When we record the first character of a new word, it cannot be whitespace,
|
|
but it probably follows a sequence of one or more whitespace characters,
|
|
and the code in the previous paragraph has been watching them for us.
|
|
</p>
|
|
|
|
|
|
<p class="macrodefinition"><code class="display">
|
|
<<span class="cwebmacrodefn">Look at recent whitespace to see what break it followed</span> <span class="cwebmacronumber">26.2</span>> =
|
|
</code></p>
|
|
|
|
|
|
<pre class="displaydefn">
|
|
<span class="reserved">if</span><span class="plain"> ((</span><span class="identifier">lxs_most_significant_space_char</span><span class="plain"> == </span><span class="character">'\</span><span class="plain">n</span><span class="character">'</span><span class="plain">) && (</span><span class="identifier">lxs_number_of_tab_stops</span><span class="plain"> >= 1))</span>
|
|
<span class="identifier">lw_array</span><span class="plain">[</span><span class="identifier">lexer_wordcount</span><span class="plain">]</span><span class="element">.lw_break</span><span class="plain"> =</span>
|
|
<span class="functiontext">Lexer::break_char_for_indents</span><span class="plain">(</span><span class="identifier">lxs_number_of_tab_stops</span><span class="plain">); </span> <span class="comment">newline followed by 1 or more tabs</span>
|
|
<span class="reserved">else</span>
|
|
<span class="identifier">lw_array</span><span class="plain">[</span><span class="identifier">lexer_wordcount</span><span class="plain">]</span><span class="element">.lw_break</span><span class="plain"> = </span><span class="identifier">lxs_most_significant_space_char</span><span class="plain">;</span>
|
|
|
|
<span class="identifier">lxs_most_significant_space_char</span><span class="plain"> = </span><span class="character">' '</span><span class="plain">; </span> <span class="comment">waiting for the next run of whitespace, after this word</span>
|
|
<span class="identifier">lxs_number_of_tab_stops</span><span class="plain"> = 0;</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="endnote">This code is used in <a href="#SP26">§26</a>.</p>
|
|
|
|
<p class="inwebparagraph"><a id="SP26_3"></a><b>§26.3. </b>Line breaks are usually like any other white space, if we are outside
|
|
literal mode, but we want to keep an eye out for paragraph breaks, because
|
|
these are sometimes semantically meaningful in NI and so cannot be
|
|
discarded. A paragraph break is converted into a special "divider" word.
|
|
</p>
|
|
|
|
|
|
<p class="macrodefinition"><code class="display">
|
|
<<span class="cwebmacrodefn">Line break outside a literal</span> <span class="cwebmacronumber">26.3</span>> =
|
|
</code></p>
|
|
|
|
|
|
<pre class="displaydefn">
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">lxs_this_line_is_empty_so_far</span><span class="plain">) {</span>
|
|
<span class="reserved">for</span><span class="plain"> (</span><span class="reserved">int</span><span class="plain"> </span><span class="identifier">i</span><span class="plain">=0; </span><span class="constant">PARAGRAPH_BREAK</span><span class="plain">[</span><span class="identifier">i</span><span class="plain">]; </span><span class="identifier">i</span><span class="plain">++)</span>
|
|
<span class="functiontext">Lexer::feed_char_into_lexer</span><span class="plain">(</span><span class="constant">PARAGRAPH_BREAK</span><span class="plain">[</span><span class="identifier">i</span><span class="plain">]);</span>
|
|
<span class="functiontext">Lexer::feed_char_into_lexer</span><span class="plain">(</span><span class="character">' '</span><span class="plain">);</span>
|
|
<span class="plain">}</span>
|
|
<span class="identifier">lxs_this_line_is_empty_so_far</span><span class="plain"> = </span><span class="identifier">TRUE</span><span class="plain">;</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="endnote">This code is used in <a href="#SP26">§26</a>.</p>
|
|
|
|
<p class="inwebparagraph"><a id="SP26_4"></a><b>§26.4. </b>When working through a literal string, a new-line together with any
|
|
preceding whitespace is converted into a single space character, and we
|
|
enter "soak up spaces" mode: in which mode, any subsequent whitespace is
|
|
ignored until something else is reached. If we reach another new-line while
|
|
still soaking up, then the literal text contained a paragraph break. In
|
|
this instance, the splurge of whitespace is converted not to a single
|
|
space <code class="display"><span class="extract">" "</span></code> but to two forced newlines in quick succession. In other words,
|
|
paragraph breaks in literal strings are converted to codes which will make
|
|
Inform print a paragraph break at run-time.
|
|
</p>
|
|
|
|
|
|
<p class="macrodefinition"><code class="display">
|
|
<<span class="cwebmacrodefn">Soak up whitespace around line breaks inside a literal string</span> <span class="cwebmacronumber">26.4</span>> =
|
|
</code></p>
|
|
|
|
|
|
<pre class="displaydefn">
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">lxs_string_soak_up_spaces_mode</span><span class="plain">) {</span>
|
|
<span class="reserved">switch</span><span class="plain">(</span><span class="identifier">c</span><span class="plain">) {</span>
|
|
<span class="reserved">case</span><span class="plain"> </span><span class="character">' '</span><span class="plain">: </span><span class="reserved">case</span><span class="plain"> </span><span class="character">'\</span><span class="plain">t</span><span class="character">'</span><span class="plain">: </span><span class="identifier">c</span><span class="plain"> = *(</span><span class="identifier">lexer_hwm</span><span class="plain">-1); </span><span class="identifier">lexer_hwm</span><span class="plain">--; </span><span class="reserved">break</span><span class="plain">;</span>
|
|
<span class="reserved">case</span><span class="plain"> </span><span class="character">'\</span><span class="plain">n</span><span class="character">'</span><span class="plain">:</span>
|
|
<span class="plain">*(</span><span class="identifier">lexer_hwm</span><span class="plain">-1) = </span><span class="identifier">NEWLINE_IN_STRING</span><span class="plain">;</span>
|
|
<span class="identifier">c</span><span class="plain"> = </span><span class="identifier">NEWLINE_IN_STRING</span><span class="plain">;</span>
|
|
<span class="reserved">break</span><span class="plain">;</span>
|
|
<span class="reserved">default</span><span class="plain">: </span><span class="identifier">lxs_string_soak_up_spaces_mode</span><span class="plain"> = </span><span class="identifier">FALSE</span><span class="plain">; </span><span class="reserved">break</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
<span class="plain">}</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">c</span><span class="plain"> == </span><span class="character">'\</span><span class="plain">n</span><span class="character">'</span><span class="plain">) {</span>
|
|
<span class="reserved">while</span><span class="plain"> (</span><span class="identifier">is_whitespace</span><span class="plain">(*(</span><span class="identifier">lexer_hwm</span><span class="plain">-1))) </span><span class="identifier">lexer_hwm</span><span class="plain">--;</span>
|
|
<span class="identifier">lxs_string_soak_up_spaces_mode</span><span class="plain"> = </span><span class="identifier">TRUE</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="endnote">This code is used in <a href="#SP26">§26</a>.</p>
|
|
|
|
<p class="inwebparagraph"><a id="SP26_5"></a><b>§26.5. Completing a word. </b>Outside of whitespace, then, our word (whatever it was — ordinary word,
|
|
literal string, I6 insertion or comment) has been stored character by
|
|
character at the steadily rising high water mark. We have now hit the end
|
|
by reaching whitespace (in the case of a literal, this has happened because
|
|
we found the end of the literal, escaped literal mode, and then hit
|
|
whitespace). The start of the word is at <code class="display"><span class="extract">lexer_word</span></code>; the last character
|
|
is stored just below <code class="display"><span class="extract">lexer_hwm</span></code>.
|
|
</p>
|
|
|
|
|
|
<p class="macrodefinition"><code class="display">
|
|
<<span class="cwebmacrodefn">Complete the current word</span> <span class="cwebmacronumber">26.5</span>> =
|
|
</code></p>
|
|
|
|
|
|
<pre class="displaydefn">
|
|
<span class="plain">*</span><span class="identifier">lexer_hwm</span><span class="plain">++ = 0; </span> <span class="comment">terminate the current word as a C string</span>
|
|
|
|
<span class="reserved">if</span><span class="plain"> ((</span><span class="identifier">lexer_wait_for_dashes</span><span class="plain">) && (</span><span class="identifier">Wide::cmp</span><span class="plain">(</span><span class="identifier">lexer_word</span><span class="plain">, </span><span class="identifier">L</span><span class="string">"----"</span><span class="plain">) == 0))</span>
|
|
<span class="identifier">lexer_wait_for_dashes</span><span class="plain"> = </span><span class="identifier">FALSE</span><span class="plain">; </span> <span class="comment">our long wait for documentation is over</span>
|
|
|
|
<span class="reserved">if</span><span class="plain"> ((</span><span class="identifier">lexer_wait_for_dashes</span><span class="plain"> == </span><span class="identifier">FALSE</span><span class="plain">) && (</span><span class="identifier">lxs_kind_of_word</span><span class="plain"> != </span><span class="constant">COMMENT_KW</span><span class="plain">)) {</span>
|
|
<<span class="cwebmacro">Issue problem message and truncate if over maximum length for what it is</span> <span class="cwebmacronumber">26.5.1</span>><span class="plain">;</span>
|
|
<<span class="cwebmacro">Store everything about the word except its break, which we already know</span> <span class="cwebmacronumber">26.5.2</span>><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
|
|
<span class="comment">now get ready for what we expect by default to be an ordinary word next</span>
|
|
<span class="identifier">lexer_word</span><span class="plain"> = </span><span class="identifier">lexer_hwm</span><span class="plain">;</span>
|
|
<span class="identifier">lxs_this_word_is_empty_so_far</span><span class="plain"> = </span><span class="identifier">TRUE</span><span class="plain">;</span>
|
|
<span class="identifier">lxs_kind_of_word</span><span class="plain"> = </span><span class="constant">ORDINARY_KW</span><span class="plain">;</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="endnote">This code is used in <a href="#SP26">§26</a>.</p>
|
|
|
|
<p class="inwebparagraph"><a id="SP26_5_1"></a><b>§26.5.1. </b>Note that here we are recording either an ordinary word, a literal string
|
|
or a literal I6 insertion: comments are also literal, but are thrown away,
|
|
and do not come here.
|
|
</p>
|
|
|
|
|
|
<pre class="definitions">
|
|
<span class="definitionkeyword">define</span> <span class="constant">MAX_STRING_LENGTH</span><span class="plain"> 8*1024</span>
|
|
<span class="definitionkeyword">enum</span> <span class="constant">STRING_TOO_LONG_LEXERERROR</span>
|
|
<span class="definitionkeyword">enum</span> <span class="constant">WORD_TOO_LONG_LEXERERROR</span>
|
|
<span class="definitionkeyword">enum</span> <span class="constant">I6_TOO_LONG_LEXERERROR</span>
|
|
</pre>
|
|
|
|
<p class="macrodefinition"><code class="display">
|
|
<<span class="cwebmacrodefn">Issue problem message and truncate if over maximum length for what it is</span> <span class="cwebmacronumber">26.5.1</span>> =
|
|
</code></p>
|
|
|
|
|
|
<pre class="displaydefn">
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="identifier">len</span><span class="plain"> = </span><span class="identifier">Wide::len</span><span class="plain">(</span><span class="identifier">lexer_word</span><span class="plain">), </span><span class="identifier">max_len</span><span class="plain"> = </span><span class="constant">MAX_WORD_LENGTH</span><span class="plain">;</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">lxs_kind_of_word</span><span class="plain"> == </span><span class="constant">STRING_KW</span><span class="plain">) </span><span class="identifier">max_len</span><span class="plain"> = </span><span class="constant">MAX_STRING_LENGTH</span><span class="plain">;</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">lxs_kind_of_word</span><span class="plain"> == </span><span class="constant">I6_INCLUSION_KW</span><span class="plain">) </span><span class="identifier">max_len</span><span class="plain"> = </span><span class="constant">MAX_VERBATIM_LENGTH</span><span class="plain">;</span>
|
|
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">len</span><span class="plain"> > </span><span class="identifier">max_len</span><span class="plain">) {</span>
|
|
<span class="identifier">lexer_word</span><span class="plain">[</span><span class="identifier">max_len</span><span class="plain">] = 0; </span> <span class="comment">truncate to its maximum length</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">lxs_kind_of_word</span><span class="plain"> == </span><span class="constant">STRING_KW</span><span class="plain">) {</span>
|
|
<span class="identifier">LEXER_PROBLEM_HANDLER</span><span class="plain">(</span><span class="constant">STRING_TOO_LONG_LEXERERROR</span><span class="plain">, </span><span class="identifier">NULL</span><span class="plain">, </span><span class="identifier">lexer_word</span><span class="plain">);</span>
|
|
<span class="plain">} </span><span class="reserved">else</span><span class="plain"> </span><span class="reserved">if</span><span class="plain"> (</span><span class="identifier">lxs_kind_of_word</span><span class="plain"> == </span><span class="constant">I6_INCLUSION_KW</span><span class="plain">) {</span>
|
|
<span class="identifier">lexer_word</span><span class="plain">[100] = 0; </span> <span class="comment">to avoid an absurdly long problem message</span>
|
|
<span class="identifier">LEXER_PROBLEM_HANDLER</span><span class="plain">(</span><span class="constant">I6_TOO_LONG_LEXERERROR</span><span class="plain">, </span><span class="identifier">NULL</span><span class="plain">, </span><span class="identifier">lexer_word</span><span class="plain">);</span>
|
|
<span class="plain">} </span><span class="reserved">else</span><span class="plain"> {</span>
|
|
<span class="identifier">LEXER_PROBLEM_HANDLER</span><span class="plain">(</span><span class="constant">WORD_TOO_LONG_LEXERERROR</span><span class="plain">, </span><span class="identifier">NULL</span><span class="plain">, </span><span class="identifier">lexer_word</span><span class="plain">);</span>
|
|
<span class="plain">}</span>
|
|
<span class="plain">}</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="endnote">This code is used in <a href="#SP26_5">§26.5</a>.</p>
|
|
|
|
<p class="inwebparagraph"><a id="SP26_5_2"></a><b>§26.5.2. </b>We recorded the break for the word when it started (recall that, even if
|
|
the current word is a literal, its first character was read outside literal
|
|
mode, so it started out in life as an ordinary word and therefore had its
|
|
break recorded). So now we need to set everything else about it, and to
|
|
increment the word-count. We must not allow this to reach its maximum,
|
|
since this would allow the next word's break setting to overwrite the
|
|
array.
|
|
</p>
|
|
|
|
<p class="inwebparagraph">For ordinary words (but not literals), the copy of a word in the main array
|
|
<code class="display"><span class="extract">lw_text</span></code> is lowered in case. The original is preserved in <code class="display"><span class="extract">lw_rawtext</span></code> and
|
|
is used to print more attractive error messages, and also to enable a few
|
|
semantic parts of NI to be case sensitive. This copying means that in the
|
|
worst case — when we complete an ordinary word of maximal length — we need
|
|
to consume an additional <code class="display"><span class="extract">MAX_WORD_LENGTH+2</span></code> bytes of the lexer's workspace,
|
|
which is why that was the amount we checked to ensure existed when the
|
|
lexer was called. The lowering loop can therefore never overspill the
|
|
workspace.
|
|
</p>
|
|
|
|
|
|
<p class="macrodefinition"><code class="display">
|
|
<<span class="cwebmacrodefn">Store everything about the word except its break, which we already know</span> <span class="cwebmacronumber">26.5.2</span>> =
|
|
</code></p>
|
|
|
|
|
|
<pre class="displaydefn">
|
|
<span class="identifier">lw_array</span><span class="plain">[</span><span class="identifier">lexer_wordcount</span><span class="plain">]</span><span class="element">.lw_rawtext</span><span class="plain"> = </span><span class="identifier">lexer_word</span><span class="plain">;</span>
|
|
<span class="identifier">lw_array</span><span class="plain">[</span><span class="identifier">lexer_wordcount</span><span class="plain">]</span><span class="element">.lw_source</span><span class="plain"> = </span><span class="identifier">lexer_position</span><span class="plain">;</span>
|
|
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">lxs_kind_of_word</span><span class="plain"> == </span><span class="constant">ORDINARY_KW</span><span class="plain">) {</span>
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="identifier">i</span><span class="plain">;</span>
|
|
<span class="identifier">lw_array</span><span class="plain">[</span><span class="identifier">lexer_wordcount</span><span class="plain">]</span><span class="element">.lw_text</span><span class="plain"> = </span><span class="identifier">lexer_hwm</span><span class="plain">;</span>
|
|
<span class="reserved">for</span><span class="plain"> (</span><span class="identifier">i</span><span class="plain">=0; </span><span class="identifier">lexer_word</span><span class="plain">[</span><span class="identifier">i</span><span class="plain">]; </span><span class="identifier">i</span><span class="plain">++) *(</span><span class="identifier">lexer_hwm</span><span class="plain">++) = </span><span class="identifier">Characters::tolower</span><span class="plain">(</span><span class="identifier">lexer_word</span><span class="plain">[</span><span class="identifier">i</span><span class="plain">]);</span>
|
|
<span class="plain">*(</span><span class="identifier">lexer_hwm</span><span class="plain">++) = 0;</span>
|
|
<span class="plain">} </span><span class="reserved">else</span><span class="plain"> {</span>
|
|
<span class="identifier">lw_array</span><span class="plain">[</span><span class="identifier">lexer_wordcount</span><span class="plain">]</span><span class="element">.lw_text</span><span class="plain"> = </span><span class="identifier">lw_array</span><span class="plain">[</span><span class="identifier">lexer_wordcount</span><span class="plain">]</span><span class="element">.lw_rawtext</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
|
|
<span class="functiontext">Vocabulary::identify_word</span><span class="plain">(</span><span class="identifier">lexer_wordcount</span><span class="plain">); </span> <span class="comment">which sets <code class="display"><span class="extract">lw_array[lexer_wordcount].lw_identity</span></code></span>
|
|
|
|
<span class="identifier">lexer_wordcount</span><span class="plain">++;</span>
|
|
<span class="functiontext">Lexer::ensure_space_up_to</span><span class="plain">(</span><span class="identifier">lexer_wordcount</span><span class="plain">);</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="endnote">This code is used in <a href="#SP26_5">§26.5</a>.</p>
|
|
|
|
<p class="inwebparagraph"><a id="SP26_6"></a><b>§26.6. Entering and leaving literal mode. </b>After a character has been stored, in ordinary mode, we see if it
|
|
provokes us into entering literal mode, by signifying the start of a
|
|
comment, string or passage of verbatim Inform 6.
|
|
</p>
|
|
|
|
<p class="inwebparagraph">In the case of a string, we positively want to keep the opening character
|
|
just recorded as part of the word: it's the opening double-quote mark.
|
|
In the case of a comment, we don't care, as we're going to throw it away
|
|
anyhow; as it happens, we keep it for now. But in the case of an I6
|
|
escape we are in danger, because of the auto-spacing around brackets, of
|
|
recording two words
|
|
</p>
|
|
|
|
<blockquote>
|
|
<p>|( -something|</p>
|
|
|
|
</blockquote>
|
|
|
|
<p class="inwebparagraph">when in fact we want to record
|
|
</p>
|
|
|
|
<blockquote>
|
|
<p>|(- something|</p>
|
|
|
|
</blockquote>
|
|
|
|
<p class="inwebparagraph">We do this by adding a hyphen to the previous word (the <code class="display"><span class="extract">(</span></code> word), and by
|
|
throwing away the hyphen from the material of the current word.
|
|
</p>
|
|
|
|
|
|
<p class="macrodefinition"><code class="display">
|
|
<<span class="cwebmacrodefn">Contemplate entering literal mode</span> <span class="cwebmacronumber">26.6</span>> =
|
|
</code></p>
|
|
|
|
|
|
<pre class="displaydefn">
|
|
<span class="reserved">switch</span><span class="plain">(</span><span class="identifier">c</span><span class="plain">) {</span>
|
|
<span class="reserved">case</span><span class="plain"> </span><span class="constant">COMMENT_BEGIN</span><span class="plain">:</span>
|
|
<span class="identifier">lxs_literal_mode</span><span class="plain"> = </span><span class="identifier">TRUE</span><span class="plain">; </span><span class="identifier">lxs_kind_of_word</span><span class="plain"> = </span><span class="constant">COMMENT_KW</span><span class="plain">;</span>
|
|
<span class="identifier">lxs_comment_nesting</span><span class="plain"> = 1;</span>
|
|
<span class="reserved">break</span><span class="plain">;</span>
|
|
<span class="reserved">case</span><span class="plain"> </span><span class="constant">STRING_BEGIN</span><span class="plain">:</span>
|
|
<span class="identifier">lxs_literal_mode</span><span class="plain"> = </span><span class="identifier">TRUE</span><span class="plain">; </span><span class="identifier">lxs_kind_of_word</span><span class="plain"> = </span><span class="constant">STRING_KW</span><span class="plain">;</span>
|
|
<span class="reserved">break</span><span class="plain">;</span>
|
|
<span class="reserved">case</span><span class="plain"> </span><span class="constant">INFORM6_ESCAPE_BEGIN_2</span><span class="plain">:</span>
|
|
<span class="reserved">if</span><span class="plain"> ((</span><span class="identifier">lxs_previous_char_in_raw_feed</span><span class="plain"> != </span><span class="constant">INFORM6_ESCAPE_BEGIN_1</span><span class="plain">) ||</span>
|
|
<span class="plain">(</span><span class="identifier">lexer_allow_I6_escapes</span><span class="plain"> == </span><span class="identifier">FALSE</span><span class="plain">)) </span><span class="reserved">break</span><span class="plain">;</span>
|
|
<span class="identifier">lxs_literal_mode</span><span class="plain"> = </span><span class="identifier">TRUE</span><span class="plain">; </span><span class="identifier">lxs_kind_of_word</span><span class="plain"> = </span><span class="constant">I6_INCLUSION_KW</span><span class="plain">;</span>
|
|
<span class="comment">because of spacing around punctuation outside literal mode, the <code class="display"><span class="extract">(</span></code> became a word</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">lexer_wordcount</span><span class="plain"> > 0) { </span> <span class="comment">this should always be true: just being cautious</span>
|
|
<span class="identifier">lw_array</span><span class="plain">[</span><span class="identifier">lexer_wordcount</span><span class="plain">-1]</span><span class="element">.lw_text</span><span class="plain"> = </span><span class="identifier">L</span><span class="string">"(-"</span><span class="plain">; </span> <span class="comment">change the previous word's text from <code class="display"><span class="extract">(</span></code> to <code class="display"><span class="extract">(-</span></code></span>
|
|
<span class="identifier">lw_array</span><span class="plain">[</span><span class="identifier">lexer_wordcount</span><span class="plain">-1]</span><span class="element">.lw_rawtext</span><span class="plain"> = </span><span class="identifier">L</span><span class="string">"(-"</span><span class="plain">;</span>
|
|
<span class="functiontext">Vocabulary::identify_word</span><span class="plain">(</span><span class="identifier">lexer_wordcount</span><span class="plain">-1); </span> <span class="comment">and re-identify</span>
|
|
<span class="plain">}</span>
|
|
<span class="identifier">lexer_hwm</span><span class="plain">--; </span> <span class="comment">erase the just-recorded <code class="display"><span class="extract">INFORM6_ESCAPE_BEGIN_2</span></code> character</span>
|
|
<span class="reserved">break</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="endnote">This code is used in <a href="#SP26">§26</a>.</p>
|
|
|
|
<p class="inwebparagraph"><a id="SP26_7"></a><b>§26.7. </b>So literal mode is used for comments, strings and verbatim passages of
|
|
Inform 6 code. We are in this mode when scanning only the middle of
|
|
the literal: after all, we scanned (and recorded) the start of the literal
|
|
in ordinary mode, before noticing that the character(s) marked the onset of
|
|
a literal.
|
|
</p>
|
|
|
|
<p class="inwebparagraph">Note that, when we leave literal mode, we set the current character to a
|
|
space. This means the character forcing our departure is lost and not
|
|
recorded: but we only actually want it in the case of strings (because
|
|
we prefer to record them in the form <code class="display"><span class="extract">"frogs and lilies"</span></code> rather than
|
|
<code class="display"><span class="extract">"frogs and lilies</span></code>, for tidiness's sake). And so for strings we explicitly
|
|
record a close quotation mark.
|
|
</p>
|
|
|
|
<p class="inwebparagraph">The new current character, being a space and thus whitespace outside of
|
|
literal mode, triggers the completion of the word, recording whatever
|
|
literal we have just made. (Or, if it was a comment, discarding it.)
|
|
<code class="display"><span class="extract">lxs_kind_of_word</span></code> continues to hold the kind of literal we have just
|
|
finished.
|
|
</p>
|
|
|
|
|
|
<p class="macrodefinition"><code class="display">
|
|
<<span class="cwebmacrodefn">Contemplate leaving literal mode</span> <span class="cwebmacronumber">26.7</span>> =
|
|
</code></p>
|
|
|
|
|
|
<pre class="displaydefn">
|
|
<span class="reserved">switch</span><span class="plain">(</span><span class="identifier">lxs_kind_of_word</span><span class="plain">) {</span>
|
|
<span class="reserved">case</span><span class="plain"> </span><span class="constant">COMMENT_KW</span><span class="plain">:</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">c</span><span class="plain"> == </span><span class="constant">COMMENT_BEGIN</span><span class="plain">) </span><span class="identifier">lxs_comment_nesting</span><span class="plain">++;</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">c</span><span class="plain"> == </span><span class="constant">COMMENT_END</span><span class="plain">) {</span>
|
|
<span class="identifier">lxs_comment_nesting</span><span class="plain">--;</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">lxs_comment_nesting</span><span class="plain"> == 0) </span><span class="identifier">lxs_literal_mode</span><span class="plain"> = </span><span class="identifier">FALSE</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
<span class="reserved">break</span><span class="plain">;</span>
|
|
<span class="reserved">case</span><span class="plain"> </span><span class="constant">STRING_KW</span><span class="plain">:</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">c</span><span class="plain"> == </span><span class="constant">STRING_END</span><span class="plain">) {</span>
|
|
<span class="identifier">lxs_string_soak_up_spaces_mode</span><span class="plain"> = </span><span class="identifier">FALSE</span><span class="plain">;</span>
|
|
<span class="plain">*(</span><span class="identifier">lexer_hwm</span><span class="plain">++) = </span><span class="identifier">c</span><span class="plain">; </span> <span class="comment">record the <code class="display"><span class="extract">STRING_END</span></code> character as part of the word</span>
|
|
<span class="identifier">lxs_literal_mode</span><span class="plain"> = </span><span class="identifier">FALSE</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
<span class="reserved">break</span><span class="plain">;</span>
|
|
<span class="reserved">case</span><span class="plain"> </span><span class="constant">I6_INCLUSION_KW</span><span class="plain">:</span>
|
|
<span class="reserved">if</span><span class="plain"> ((</span><span class="identifier">c</span><span class="plain"> == </span><span class="constant">INFORM6_ESCAPE_END_2</span><span class="plain">) &&</span>
|
|
<span class="plain">(</span><span class="identifier">lxs_previous_char_in_raw_feed</span><span class="plain"> == </span><span class="constant">INFORM6_ESCAPE_END_1</span><span class="plain">)) {</span>
|
|
<span class="identifier">lexer_hwm</span><span class="plain">--; </span> <span class="comment">erase the <code class="display"><span class="extract">INFORM6_ESCAPE_END_1</span></code> character recorded last time</span>
|
|
<span class="identifier">lxs_literal_mode</span><span class="plain"> = </span><span class="identifier">FALSE</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
<span class="reserved">break</span><span class="plain">;</span>
|
|
<span class="reserved">default</span><span class="plain">: </span><span class="identifier">internal_error</span><span class="plain">(</span><span class="string">"in unknown literal mode"</span><span class="plain">);</span>
|
|
<span class="plain">}</span>
|
|
<span class="reserved">if</span><span class="plain"> (</span><span class="identifier">lxs_literal_mode</span><span class="plain"> == </span><span class="identifier">FALSE</span><span class="plain">) </span><span class="identifier">c</span><span class="plain"> = </span><span class="character">' '</span><span class="plain">; </span> <span class="comment">trigger completion of this word</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="endnote">This code is used in <a href="#SP26">§26</a>.</p>
|
|
|
|
<p class="inwebparagraph"><a id="SP26_8"></a><b>§26.8. Breaking strings up at text substitutions. </b>When text contains text substitutions, these are ordinarily ignored by the
|
|
lexer, but in <code class="display"><span class="extract">lexer_divide_strings_at_text_substitutions</span></code> mode, we need to
|
|
force strings to end and resume at the two ends of each substitution. For
|
|
instance:
|
|
</p>
|
|
|
|
<blockquote>
|
|
<p>"Hello, [greeted person]. Do you make it [supper time]?"</p>
|
|
|
|
</blockquote>
|
|
|
|
<p class="inwebparagraph">must be split as
|
|
</p>
|
|
|
|
<blockquote>
|
|
<p>|"Hello, " , greeted person , ". Do you make it " , supper time , "?"|</p>
|
|
|
|
</blockquote>
|
|
|
|
<p class="inwebparagraph">where our original single text literal is now three text literals, plus
|
|
eight ordinary words (four of them commas).
|
|
</p>
|
|
|
|
<p class="inwebparagraph">Note that each open square bracket, and each close square bracket, has been
|
|
removed and become a comma word. We see to open squares before we come
|
|
to recording the character, so to get rid of the <code class="display"><span class="extract">[</span></code> character, we change
|
|
<code class="display"><span class="extract">c</span></code> to a space:
|
|
</p>
|
|
|
|
|
|
<p class="macrodefinition"><code class="display">
|
|
<<span class="cwebmacrodefn">Force string division at the start of a text substitution, if necessary</span> <span class="cwebmacronumber">26.8</span>> =
|
|
</code></p>
|
|
|
|
|
|
<pre class="displaydefn">
|
|
<span class="reserved">if</span><span class="plain"> ((</span><span class="identifier">lexer_divide_strings_at_text_substitutions</span><span class="plain">) && (</span><span class="identifier">c</span><span class="plain"> == </span><span class="constant">TEXT_SUBSTITUTION_BEGIN</span><span class="plain">)) {</span>
|
|
<span class="functiontext">Lexer::feed_char_into_lexer</span><span class="plain">(</span><span class="constant">STRING_END</span><span class="plain">); </span> <span class="comment">feed <code class="display"><span class="extract">"</span></code> to close the old string</span>
|
|
<span class="functiontext">Lexer::feed_char_into_lexer</span><span class="plain">(</span><span class="character">' '</span><span class="plain">);</span>
|
|
<span class="functiontext">Lexer::feed_char_into_lexer</span><span class="plain">(</span><span class="constant">TEXT_SUBSTITUTION_SEPARATOR</span><span class="plain">); </span> <span class="comment">feed <code class="display"><span class="extract">,</span></code> to start new word</span>
|
|
<span class="identifier">c</span><span class="plain"> = </span><span class="character">' '</span><span class="plain">; </span> <span class="comment">the lexer now goes on to record a space, which will end the <code class="display"><span class="extract">,</span></code> word</span>
|
|
<span class="identifier">lxs_scanning_text_substitution</span><span class="plain"> = </span><span class="identifier">TRUE</span><span class="plain">; </span> <span class="comment">but remember that we must get back again</span>
|
|
<span class="plain">}</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="endnote">This code is used in <a href="#SP26">§26</a>.</p>
|
|
|
|
<p class="inwebparagraph"><a id="SP26_9"></a><b>§26.9. </b>Whereas we see to close squares after recording the character, so we have
|
|
to erase it to get rid of the <code class="display"><span class="extract">]</span></code>. Note that since this was read in ordinary
|
|
mode, it was automatically spaced (being punctuation), and that therefore
|
|
the feeder above has just sent the second of a sequence of three characters:
|
|
space, <code class="display"><span class="extract">]</span></code>, space. That means we have recorded, so far, a one-character
|
|
word in ordinary mode, whose text consists only of <code class="display"><span class="extract">]</span></code>. By overwriting
|
|
this with a comma, we instead get a one-character word in ordinary mode
|
|
whose text consists only of a comma. We then feed a space to end that word;
|
|
then feed a double-quote to start text again.
|
|
</p>
|
|
|
|
<p class="inwebparagraph">But, it might be objected: surely the feeder above is still poised with
|
|
that third character in its sequence space, <code class="display"><span class="extract">]</span></code>, space, and that means
|
|
it will now feed a spurious space into the start of our resumed text?
|
|
Happily, the answer is no: this is why the feeder above checks that it
|
|
is still in ordinary mode before sending that third character. Having
|
|
open quotes again, we have put the lexer into literal mode: and so the
|
|
spurious space is never fed, and there is no problem.
|
|
</p>
|
|
|
|
|
|
<p class="macrodefinition"><code class="display">
|
|
<<span class="cwebmacrodefn">Force string division at the end of a text substitution, if necessary</span> <span class="cwebmacronumber">26.9</span>> =
|
|
</code></p>
|
|
|
|
|
|
<pre class="displaydefn">
|
|
<span class="reserved">if</span><span class="plain"> ((</span><span class="identifier">lexer_divide_strings_at_text_substitutions</span><span class="plain">) && (</span><span class="identifier">c</span><span class="plain"> == </span><span class="constant">TEXT_SUBSTITUTION_END</span><span class="plain">)) {</span>
|
|
<span class="identifier">lxs_scanning_text_substitution</span><span class="plain"> = </span><span class="identifier">FALSE</span><span class="plain">;</span>
|
|
<span class="plain">*(</span><span class="identifier">lexer_hwm</span><span class="plain">-1) = </span><span class="constant">TEXT_SUBSTITUTION_SEPARATOR</span><span class="plain">; </span> <span class="comment">overwrite recorded copy of <code class="display"><span class="extract">]</span></code> with <code class="display"><span class="extract">,</span></code></span>
|
|
<span class="functiontext">Lexer::feed_char_into_lexer</span><span class="plain">(</span><span class="character">' '</span><span class="plain">); </span> <span class="comment">then feed a space to end the <code class="display"><span class="extract">,</span></code> word</span>
|
|
<span class="functiontext">Lexer::feed_char_into_lexer</span><span class="plain">(</span><span class="constant">STRING_BEGIN</span><span class="plain">); </span> <span class="comment">then feed <code class="display"><span class="extract">"</span></code> to open a new string</span>
|
|
<span class="plain">}</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="endnote">This code is used in <a href="#SP26">§26</a>.</p>
|
|
|
|
<p class="inwebparagraph"><a id="SP27"></a><b>§27. </b>Finally, note that the breaking-up process may result in empty strings
|
|
where square brackets abut each other or the ends of the original string.
|
|
Thus
|
|
</p>
|
|
|
|
<blockquote>
|
|
<p>"[The noun] is on the [colour][style] table."</p>
|
|
|
|
</blockquote>
|
|
|
|
<p class="inwebparagraph">is split as: <code class="display"><span class="extract">"" , The noun , " is on the " , colour , "" , style , " table."</span></code>
|
|
This is not a bug: empty strings are legal. It's for higher-level code to
|
|
remove them if they aren't wanted.
|
|
</p>
|
|
|
|
<p class="inwebparagraph"><a id="SP28"></a><b>§28. Splicing. </b>Once in a while, we need to have a run of words in the lexer which
|
|
all do occur in the source text, but not contiguously, so that they
|
|
cannot be represented by a pair <code class="display"><span class="extract">(w1, w2)</span></code>. In that event we use the
|
|
following routine to splice duplicate references at the end of the word
|
|
list (this does not duplicate the text itself, only references to it):
|
|
for instance, if we start with 10 words (0 to 9) and then splice <code class="display"><span class="extract">(2,3)</span></code>
|
|
and then <code class="display"><span class="extract">(6,8)</span></code>, we end up with 15 words, and the text of <code class="display"><span class="extract">(10,14)</span></code>
|
|
contains the same material as words 2, 3, 6, 7, 8.
|
|
</p>
|
|
|
|
|
|
<pre class="display">
|
|
<span class="reserved">wording</span><span class="plain"> </span><span class="functiontext">Lexer::splice_words</span><span class="plain">(</span><span class="reserved">wording</span><span class="plain"> </span><span class="identifier">W</span><span class="plain">) {</span>
|
|
<span class="reserved">int</span><span class="plain"> </span><span class="identifier">L</span><span class="plain"> = </span><span class="functiontext">Wordings::length</span><span class="plain">(</span><span class="identifier">W</span><span class="plain">);</span>
|
|
<span class="functiontext">Lexer::ensure_space_up_to</span><span class="plain">(</span><span class="identifier">lexer_wordcount</span><span class="plain"> + </span><span class="identifier">L</span><span class="plain">);</span>
|
|
<span class="reserved">for</span><span class="plain"> (</span><span class="reserved">int</span><span class="plain"> </span><span class="identifier">i</span><span class="plain">=0; </span><span class="identifier">i</span><span class="plain"><</span><span class="identifier">L</span><span class="plain">; </span><span class="identifier">i</span><span class="plain">++)</span>
|
|
<span class="functiontext">Lexer::word_copy</span><span class="plain">(</span><span class="identifier">lexer_wordcount</span><span class="plain">+</span><span class="identifier">i</span><span class="plain">, </span><span class="functiontext">Wordings::first_wn</span><span class="plain">(</span><span class="identifier">W</span><span class="plain">)+</span><span class="identifier">i</span><span class="plain">);</span>
|
|
<span class="reserved">wording</span><span class="plain"> </span><span class="identifier">N</span><span class="plain"> = </span><span class="functiontext">Wordings::new</span><span class="plain">(</span><span class="identifier">lexer_wordcount</span><span class="plain">, </span><span class="identifier">lexer_wordcount</span><span class="plain"> + </span><span class="identifier">L</span><span class="plain"> - 1);</span>
|
|
<span class="identifier">lexer_wordcount</span><span class="plain"> += </span><span class="identifier">L</span><span class="plain">;</span>
|
|
<span class="reserved">return</span><span class="plain"> </span><span class="identifier">N</span><span class="plain">;</span>
|
|
<span class="plain">}</span>
|
|
</pre>
|
|
|
|
<p class="inwebparagraph"></p>
|
|
|
|
<p class="endnote">The function Lexer::splice_words is used in 3/fds (<a href="3-fds.html#SP6">§6</a>).</p>
|
|
|
|
<hr class="tocbar">
|
|
<ul class="toc"><li><i>(This section begins Chapter 3: Words in Sequence.)</i></li><li><a href="3-wrd.html">Continue with 'Wordings'</a></li></ul><hr class="tocbar">
|
|
<!--End of weave-->
|
|
</main>
|
|
</body>
|
|
</html>
|
|
|