Skip to content

NLG

This page documents the NLG system from early 2023, which is/was used by the web form generation. The main goal was to convert the conditions in the rules into questions.

The code is in natural4/src/LS/NLP/NLG.hs, and the grammars it uses are in natural4/grammars.

Input

The input is a spreadsheet such as the following, which has plenty of natural language in the cells.

§ Assessment
EVERY Organisation
WHICH NOT is a Public Agency
UPON becoming aware a data breach may have occurred
IF the data breach occurred ON 1 Feb 2022
OR " AFTER "
MUST assess if it is a Notifiable Data Breach

It is parsed into the Rule datatype, where the different text fragments are parsed into different fields. So the line NOT,is,a Public Agency is a qualifier of the subject (Organisation), and because it is in that particular field, the system will parse it as a relative clause.

The other fields have their corresponding grammatical category as well. The action assess,if it… will be parsed as a verb phrase. The condition the data breach occurred is parsed as a full sentence, and if the sentence has a temporal condition (keywords ON, AFTER etc.), it's parsed into an adverbial.

GF grammar

The fragments are parsed with a GF grammar with the following abstract syntax.

flowchart TD
    A[CustomSyntax\n● Fragment of the GF RGL\n● Ad hoc funs on top] --> B[NL4Base\n● Domain-specific\n constructions for L4]
    B --> C[StandardLexicon\n● Curated lexicon of\n frequent vocabulary]
    B --> D[DomainLexicon\n● Generated from\n the current document\n● Not automatic,\njust a proof of concept!]

GF RGL is the Resource Grammar Library, which contains basic syntactic constructions for ~40 different languages. CustomSyntax takes a subset of those, and adds some specific constructions on top of it.

These constructions were just added there by Inari based on the two example cases: PDPA and Rodents and Vermins. For example, the RGL doesn't support conjunction of prepositions like "on or after", so we added it into the CustomSyntax module.

We make some assumptions about which forms the fragments appear, for example, after UPON, there should be a verb phrase in gerund, not a full sentence with a subject. But these assumptions haven't been written down anywhere.

Tree transformations to produce new natural language

After the different fields are parsed into GF trees, we can take their constituents and transform them into different trees.

The most successful application has been creating questions for the web form. For instance, UPON,becoming aware that a data breach may have occurred would become a question "have you become aware that a data breach may have occurred". We can assume that the subject is you, i.e. the person filling the web form. This is nice, because each of the conditions needs to be fulfilled in order for the whole rule to hold, and they get their own individual questions as full sentences.

There has always been the goal to create a full document from the spreadsheet version. Some prototypes have existed, but their problem has been that the sentences become very long. I (Inari) personally think that the original spreadsheet form is more readable, because it uses indentation and linebreaks. So if (/when) we want to create a full plain English document from the spreadsheet, we'll need to do something smarter than just concatenate the subtrees into larger trees.

Lexicon generation

We (Maryam and Inari) also did some smaller experiments in generating the lexicon automatically, using an external parser via the Python NLP library spaCy. Probably the most up-to-date version of the code that generates GF lexica is here in Inari's sandbox.

It was surprisingly good for getting the valencies of the verbs, but there was still a substantial amount of manual checking and correction needed. But given that the next use case didn't use the web app, this system was never needed in practice.

We also never automated the pipeline.

Status

The current code and grammar are deprecated, but I (Inari) believe that there are ideas that are worth to salvage and develop further.

Future goals

Curated lexicon

Actually create a large curated lexicon (the module StandardLexicon in the graph). It should have at least some thousands of words, and probably in several modules, divided in subdomains.

CNL

Make the natural language in the cells more controlled. For instance, we could revisit Meng's "prepositional logic", like the example below.

eat noodles
with chopsticks
at noon

Unlike the all caps keywords of L4, should we have natural language keywords that we treat in a specific way when parsing?