inform7/services/words-module/Chapter 4/Nonterminals.w

[Nonterminals::] Nonterminals.

The angle-bracketed terms appearing in Preform grammar.

@h How nonterminals are stored.
Each different nonterminal defined in the |Syntax.preform| code read in,
such as <any-integer>, is going to correspond to a global variable in the
program reading it in, such as |any_integer_NTM|. On the face of it, this is
impossible. How can what happens at run-time affect what variables are named
at compile time?

The answer is that the //inweb// literate programming tool looks through the
complete source code, sees the Preform nonterminals described in it, and
inserts declarations of the corresponding variables into the "tangled" form
of the source code sent to a C compiler to make the actual program. (This is
a feature of //inweb// available only for programs written in InC.)

In particular, the tangler of |inweb| replaces the |[[nonterminals]]| below with
invocations of the |REGISTER_NONTERMINAL| and |INTERNAL_NONTERMINAL| macros.
For example, it inserts the C line:
= (text as C)
	INTERNAL_NONTERMINAL(L"<any-integer>", any_integer_NTM, 1, 1);
=
since this is an "internal" nonterminal; and the macro will then expand
to code which sets up |any_integer_NTM| -- see below.

=
void Nonterminals::register(void) {
	/* The following is not valid C, but causes Inweb to insert lines which are */
	[[nonterminals]];
	/* Back to regular C now */
	nonterminal *nt;
	LOOP_OVER(nt, nonterminal)
		if ((nt->marked_internal) && (nt->internal_definition == NULL))
			internal_error("internal nonterminal has no definition function");
}

@ So, then, //inweb// tangles out code which uses the |REGISTER_NONTERMINAL|
macro for any standard nonterminal, and also tangles a compositor function for
it; the name of which is the nonterminal's name with a |C| suffix. For example,
suppose //inweb// sees the following in the web it is tangling:
= (text as Preform)
	<competitor> ::=
		the pacemaker |              ==> { 1, - }
		<ordinal-number> runner |    ==> { pass 1 }
		runner no <cardinal-number>  ==> { pass 1 }
=
It then tangles this macro usage into //Nonterminals::register// above:
= (text as C)
	REGISTER_NONTERMINAL(L"<competitor>", competitor_NTM);
=
And it also tangles matching declarations for:
(a) the global variable |competitor_NTM|, of type |nonterminal *|;
(b) the "compositor function" |competitor_NTMC|, which is a function to
deal with what happens when a successful match is made against the grammar --
this incorporates the material which //inweb// finds to the right of the |==>|
markers in the Preform definition.

But if we left things at that, we would find ourselves at run-time with
a null variable, a function not called from anywhere, and an instance
somewhere in memory of a nonterminal read in from Preform syntax and
called |"<competitor>"|, but which has no apparent connection to either
the function or the variable. We clearly need to join these together.

And so the |REGISTER_NONTERMINAL| macro expands to code which initialises the
variable to the nonterminal having its name, and then connects that to the
compositor function:

@d REGISTER_NONTERMINAL(quotedname, identifier)
	identifier = Nonterminals::find(Vocabulary::entry_for_text(quotedname));
	identifier->compositor_fn = identifier##C;

@ For example, this might expand to:
= (text as C)
	competitor_NTM = Nonterminals::find(Vocabulary::entry_for_text(L"<competitor>"));
	competitor_NTM->compositor_fn = competitor_NTMC;
=
Note that it is absolutely necessary that |Nonterminals::find| does
return a nonterminal. But we can be sure that it does, since the function creates
a nonterminal object of that name even if one does not already exist.

@ The position for internal nonterminals (i.e. those defined by a function
written by the programmer, not by Preform grammar lines) is similar:
(a) again there is a global variable, say |any_integer_NTM|, of type |nonterminal *|;
(b) but now there is no compositor, and instead there is a function |any_integer_NTMR|
which actually performs the parse directly.

The |INTERNAL_NONTERMINAL| macro similarly initialises and connects these
declarations. |min| and |max| are conveniences for speedy parsing, and supply
the minimum and maximum number of words that the nonterminal can match; these
are needed because the Preform optimiser can't see inside |any_integer_NTMR| to
calculate those bounds for itself. |max| can be infinity, in which case we
use the constant |INFINITE_WORD_COUNT| for it.

@d INTERNAL_NONTERMINAL(quotedname, identifier, min, max)
	identifier = Nonterminals::find(Vocabulary::entry_for_text(quotedname));
	identifier->opt.nt_extremes = LengthExtremes::new(min, max);
	identifier->internal_definition = identifier##R;
	identifier->marked_internal = TRUE;

@ So, then, the following rather lengthy class declaration shows what goes
into a nonterminal. Note that nonterminals are uniquely identifiable by their
names: there can be only one called, say, <any-integer>. This is why its
textual name is referred to as an "ID".

=
typedef struct nonterminal {
	struct vocabulary_entry *nonterminal_id; /* e.g. |"<any-integer>"| */

	/* For internal nonterminals */
	int marked_internal; /* has, or will be given, an internal definition... */
	int (*internal_definition)(wording W, int *result, void **result_p); /* ...this one */
	int voracious; /* if true, scans whole rest of word range */

	/* For regular nonterminals */
	struct production_list *first_pl; /* if not internal, this defines it */
	int (*compositor_fn)(int *r, void **rp, int *i_s, void **i_ps, wording *i_W, wording W);
	int multiplicitous; /* if true, matches are alternative syntax tree readings */
	int number_words_by_production; /* this parses names for numbers, like "huit" or "zwei" */
	unsigned int flag_words_in_production; /* all words in the production should get these flags */

	/* Storage for most recent correct match */
	struct wording range_result[MAX_RANGES_PER_PRODUCTION]; /* storage for word ranges matched */

	struct nonterminal_optimisation_data opt; /* see //The Optimiser// */
	struct nonterminal_instrumentation_data ins; /* see //Instrumentation// */

	CLASS_DEFINITION
} nonterminal;

@ A few notes on this are in order:

(a) As noted above, every nonterminal is either "internal" or "regular". If
internal, it is defined by a function; if regular, it is defined by lines
of grammar (called "productions") and a compositor function.

(b) A few internal nonterminals are "voracious". These are given the entire
word range for their productions to eat, and encouraged to eat as much as they
like, returning a word number to show how far they got. While this effect
could be duplicated with non-voracious nonterminals, that would be quite a bit
slower, since it would have to test every possible word range.

(c) A few regular nonterminals are "multiplicitous". These composite their
results in a way special to the Inform compiler's syntax tree, by stacking
them up as alternative possible readings of the same text. Ordinarily, the
result of parsing text against a nonterminal is that the first grammar line
matching that text determines the meaning, but for a multiplicitous nonterminal,
every line matching the text determines one of perhaps many possible meanings.

(d) For numbering and flagging on regular NTs, see //Nonterminals::make_numbering//
below.

(e) The optimisation data helps the parser to reject non-matching text quickly.
For example, if the optimiser can determine that <competitor> only ever matches
texts of between 3 and 7 words in length, it can quickly reject any run of
words outside that range. (However: note that a maximum of 0 means that the
maximum and minimum word counts are disregarded.) The other fields are harder
to explain -- see //The Optimiser//.

@ So, then, as noted above, nonterminals are identified by their name-words.
The following is not especially fast but doesn't need to be: it's used only
when Preform grammar is parsed, not when Inform text is parsed.

=
nonterminal *Nonterminals::detect(vocabulary_entry *name_word) {
	nonterminal *nt;
	LOOP_OVER(nt, nonterminal)
		if (name_word == nt->nonterminal_id)
			return nt;
	return NULL;
}

@ And the following always returns one, creating it if necessary:

=
nonterminal *Nonterminals::find(vocabulary_entry *name_word) {
	nonterminal *nt = Nonterminals::detect(name_word);
	if (nt == NULL) {
		nt = CREATE(nonterminal);
		nt->nonterminal_id = name_word;

		nt->marked_internal = FALSE; /* by default, nonterminals are regular */
		nt->internal_definition = NULL;
		nt->voracious = FALSE;

		nt->first_pl = NULL;
		nt->compositor_fn = NULL;
		nt->multiplicitous = FALSE;
		nt->number_words_by_production = FALSE; /* i.e., don't */
		nt->flag_words_in_production = 0; /* i.e., apply no flags */

		for (int i=0; i<MAX_RANGES_PER_PRODUCTION; i++)
			nt->range_result[i] = EMPTY_WORDING;

		Optimiser::initialise_nonterminal_data(&(nt->opt));
		Instrumentation::initialise_nonterminal_data(&(nt->ins));
	}
	return nt;
}

@h Word ranges in a nonterminal.
We now need to define the macros |GET_RW| and |PUT_RW|, which get and set
the results of a successful match against a nonterminal (see //About Preform//
for more on this).

We do so by giving each nonterminal a small array of |wording|s, which are
lightweight structures incurring little time or space overhead. The fact that
they are attached to the NT itself, rather than, say, being placed on a
parsing stack of some kind, makes them faster to access, but is possible only
because the parser never backtracks. Similarly, results word ranges are
overwritten if a nonterminal calls itself directly or indirectly: that is, the
inner one's results are wiped out by the outer one. But this is no problem,
since we never extract word-ranges from grammar which is recursive.

Word range 0 is reserved in case we ever need it for the entire text matched
by the nonterminal, though at present we don't need that.

@d MAX_RANGES_PER_PRODUCTION 5 /* in fact, one less than this, since range 0 is reserved */
@d GET_RW(nt, N) (nt->range_result[N])
@d PUT_RW(nt, N, W) { nt->range_result[N] = W; }
@d INHERIT_RANGES(from, to) {
	for (int i=1; i<MAX_RANGES_PER_PRODUCTION; i++) /* not copying range 0 */
		to->range_result[i] = from->range_result[i];
}
@d CLEAR_RW(from) {
	for (int i=0; i<MAX_RANGES_PER_PRODUCTION; i++) /* including range 0 */
		from->range_result[i] = EMPTY_WORDING;
}

@h Other results.
The parser records the result of the most recently matched nonterminal in the
following global variables -- which, unlike word ranges, are not attached to
any single NT.

//inweb// translates the notation |<<r>>| and |<<rp>>| to these variable names:

=
int most_recent_result = 0; /* the variable which |inweb| writes |<<r>>| */
void *most_recent_result_p = NULL; /* the variable which |inweb| writes |<<rp>>| */

@h Flagging and numbering.
The following mechanism arranges for words used in the grammar for a NT to
be given properties just because of that -- either flags or numerical values.
For example, if we wanted the numbers from Stoppard's play "Dogg's Hamlet",
we might have:
= (text as Preform)
	<dogg-numbers> ::=
		sun | dock | trog | slack | pan
=
And if <dogg-numbers> were made a "numbering" NT, the effect would be that
these five words would pick up the numerical values 1, 2, 3, 4, 5, because
they occur in production number 1, 2, 3, 4, 5 for the NT.

=
void Nonterminals::make_numbering(nonterminal *nt) {
	nt->number_words_by_production = TRUE;
}

@ Similarly, we could flag this NT with |NUMBER_MC|, and then the five words
sun, dock, trog, slack, pan would all pick up the |NUMBER_MC| flag
automatically.

=
void Nonterminals::flag_words_with(nonterminal *nt, unsigned int flags) {
	nt->flag_words_in_production = flags;
}

@ This is all done by the following function, which is called when a word |ve|
is read as part of a production with match number |pc| for the nonterminal |nt|:

=
void Nonterminals::note_word(vocabulary_entry *ve, nonterminal *nt, int pc) {
	ve->flags |= (nt->flag_words_in_production);
	if (nt->number_words_by_production) ve->literal_number_value = pc;
}