Spreadsheets introduced reactive (aka continuous evaluation[1]),
side-effect free formulas to the massses,
and despite the well-known limitations of spreadsheets [2],
no other end-user programming environment has managed to surpass their popularity.
Despite hierarchical data naturally occurring in more than half of data structures novices organically create [3],
few formula languages support hierarchical data structures.
Moreover, the few that do often prioritize compatibility with spreadsheets
over natural programming and general HCI principles.
Formula² (MavoScript in earlier literature [4])
is a formula language designed from scratch to support hierarchical data structures
in a way that is natural for novices and reduces the amount of abstract thinking about data required.
However, if the research question was simply How can we design a natural formula language?,
the answer is simple — no structured language can beat natural language at that — almost by definition.
But parsing natural language is resource-intensive and error-prone,
so instead we designed Formula2 to answer the question
What would a formula language look like if it were designed from scratch today,
with the explicit goals of prioritizing usability, natural programming principles [5, 6],
and making common use cases easy,
while still maintaining reasonable parsing and evaluation complexity?
Indeed, while our user studies repeatedly demonstrated that Formula² was easier for novices than alternatives,
it can be parsed very efficiently with a simple Pratt parser [7].
While Formula² was originally developed for Mavo,
it has no dependency on any particular host language or system.
It could be used on hierarchical spreadsheet-like applications,
compiled to JavaScript and used by JavaScript frameworks,
or even used in a standalone environment against arbitrary data (see Figure 4.1).
Consistency with JavaScript or other web technologies was not a primary design consideration for its syntax,
but was used as a tie-breaker for otherwise equally valid design decisions.
Since LANPAR introduced reactive formulas in 1969,
and VisiCalc augmented them them in 1979
with the A1 notation and most of the features we know today [2],
a lot of work has been done to improve on their design.
Gneiss [8, 9] was a spreadsheet system supporting hierarchical data.
However, the bulk of its contributions were in the user interface.
Its formula language prioritized compatibility with spreadsheets and thus,
included the minimum amount of changes needed to support hierarchical data.
References to data still used the obscure A1 notation, just extended it for nested data.
SIEUFERD [10] provided a spreadsheet-like hiearchical data interface
to a SQL database, and thus had to design a formula language that could handle hierarchical data.
While it used readable names rather than A1 notation, compatibility with spreadsheets
was a primary design goal.
More importantly, the primary means of querying list-valued data was its interface for visually constructing SQL queries,
and thus the formula language is purposefully restricted to scalars as a return value.
Spreadsheet formulas are not the only syntax in wide use for reactive formulas.
While spreadsheet formulas are the most popular end-user programming syntax for reactive formulas,
this section would not be complete without mentioning the many reactive JavaScript frameworks,
such as VueJS [11], Angular [12], or Svelte [13]
These frameworks typically support embedding a restricted subset of JavaScript in HTML,
and detect dependencies by parsing the JavaScript code.
There are many other dataflow languages, such as LabView [14],
a visual programming language for dataflow programming.
However, these are typically not designed for end-users,
and are often not reactive in the same sense as spreadsheet formulas.
The usability issues with LabView are well-documented [15].
Mashroom [16] was a dataflow language designed for mashups,
and operating on nested tables.
While some of its operators were similar to Formula²,
and it also centered around aggregation,
it required a lot more technical knowledge to use
and targeted different use cases.
We define the host environment as the environment in which Formula² is embedded.
This could be a spreadsheet-like application, a no-code visual app builder, a web framework,
or a standalone application that allows the user to directly import data, specify Formula²
expressions, optionally select the context node (see Section 4.4.1), and see the result (Figure 4.1).
We have already seen one host environment, Mavo HTML.
Since Mavo is the only host environment of Formula² that we have seen so far,
it may be unclear where Formula² ends and the host environment begins.
Formula² is responsible for (a) parsing and compling expressions,
and (b) evaluating them against an arbitrary data tree.
To maximize flexibility and make it easier to adopt by a variety of host environments,
monitoring dependencies and re-evaluating expressions at the appropriate times
is handled by the host environment.
It could be argued that this makes Formula2 simply a functional, side-effect free language,
since it is the host environment that may (or may not) implement it reactively.
For example, the testing host environment shown in Figure 4.1 does not implement any reactivity.
However, we believe this to be an implementation detail.
The language design and semantics assume a reactive implementation.
The fact that it can function (and even be useful) without reactivity is a testament to its flexibility.
An explicit design goal was to minimize the number of syntactic primitives that need to be learned,
and instead allow as many combinations of these primitives as possible.
Literals: Numbers, strings (enclosed in single quotes, double quotes, or none (see below)), booleans (true/false), empty values (none)
Identifiers: Any string of letters, numbers, or underscores. $ is also allowed, but is reserved for predefined Formula2 identifiers.
Function calls are invoked with the usual syntax of functionName(arg1, arg2, ...).
Commas are optional, but encouraged.
Operators, which can be unary, binary, ternary, or n-ary.
Complex data types such as lists and groups are also constructed using functions and operators.
Going beyond syntactic differences, referencing in JavaScript (and most other programming languages) is shallow:
we cannot reference any task properties without getting ahold of each task object, and we could do that without iterating over the list of tasks.
There are two levels of indirection to go from the data root to the data we are interested in.
This approach minimizes conflicts, but at the cost of verbosity and complexity.
In natural language, scoping is a way to optionally narrow, not a prerequisite for meaning.
Any known object can be referenced, and the context of the conversation is implicitly used to resolve it.
This process may even involve iteration or even a recursive walk through the hierarchy of known objects,
yet humans handle it effortlessly.
We can say put all dirty dishes in the dishwasher without needing to specify
whether the dishes are on the table, in the sink, or in the bedroom.
We can optionally scope it further, e.g. put all dirty dishes from the dinner table in the dishwasher,
but the scope is optional, and the dirty dishes concept has meaning without it, just broader.
In contrast, while programming languages also have a notion of context,
references are typically more restricted.
Context typically simplifies references to identifiers directly within the current scope,
or in ancestor scopes, but that’s about it.
Getting descendant or sibling data typically requires a series of recursive mapping operations.
To word the dirty dishes example using the concepts of most programming languages,
we would have needed to say something like:
“Visit every room in the house and do the following.
If there are any dirty dishes on the floor, put them in the dishwasher.
Look for objects with horizontal surfaces, then look for dirty dishes on them.
If you find any, put them in the dishwasher.
Now look inside all containers larger than a plate.
If you find any dirty dishes, put them in the dishwasher.
If you find other large enough containers, repeat the process for them too.”
Imagine how tedious communication would be if we had to speak like this!
And yet, we have accepted that this is a reasonable way to communicate with computers.
This could explain why scoping and referencing errors are so common among beginner programmers [17–19].
The need to understand scoping rules, write out lengthy namespaces, or — worse — mapping operations when multiple values are desired
are all barriers to entry.
To alleviate this, Formula² uses a novel scoping algorithm that will prioritize explicit references over implicit ones,
but will attempt to resolve any known identifier to the most reasonable author intent,
taking into account both the structure of the formula, the data schema, as well as the placement of the formula relative to the data.
In Formula², any known data identifier can be used anywhere.
However, the value(s) it resolves to depends on the placement of the formula.
Every evaluation of a Formula² expression can be associated with a context node in the data.
This is in contrast with scoping in most programming languages,
which is either lexical (derived from the code hierarchy) or dynamic (derived from the call tree).
Typically the context node would have a spatial association with the formula in the host environment,
for example in Mavo it would be the closest containing property.
In a spreadsheet it could be the cell the formula is in.
A given identifier can resolve to a single value or a list of values,
depending on the relative locations of the context node, the data it references,
and the data schema.
Conceptually, to resolve the value of an identifier, we perform a breadth-first search1This is a simplification and would result in very poor performance.
In practice, the Breadth-First Search is performed on a precomputed schema of the data,
and then the paths are resolved on the actual data. on the data
to find the shortest path(s) from the context node to properties with the given identifier.
We then resolve these paths to values.
If there are more than one paths, or the paths cross any arrays (even if they only contain one element),
the result is a list of values.
If the property is not found at all, then ancestor nodes are searched for the property.
While this algorithm may sound complex,
in practice it affords a more natural way to reference data, where references largely just work
and scoping rarely needs to be explicitly considered.
Since lists are such common return types,
built-in support for list operations is essential.
But Formula² goes a step beyond that, and attempts to blur the line between scalar and list operations.
Its answer to “Am I dealing with a list or a scalar?” is if you don’t already know, it shouldn’t matter,
in the sense that the author providing a list or a scalar communicates intent,
and that intent is generally honored,
but whether a function’s return value is a list or a scalar should largely be irrelevant,
except when it’s a predictable consequence of said author intent.
Part of this is a natural consequence of the scoping algorithm.
Since foo.bar looks at descendant properties via a breadth-first search,
the result will be the same whether foo is a single object with the foo property
or a list containing said object as its only element
(since Formula² lists don’t include arbitrary data properties).
Nearly all Formula² operations (functions and operators) are list-aware.
Operations on lists, between lists, or between scalars and lists require no special syntax or functions.
They largely just work.
The algorithm used is simple.
Operations (including functions) are defined in terms of their scalar equivalents.
When both operands are scalars, the scalar operation is applied.
When only one operand is scalar, the result is a list where each list item is the result of applying the operation to the scalar and the list item.
When both operands are lists, the operation is applied element-wise.
If the lists have different lengths, null is used for missing values.
For example, if rating is a list of ratings across items,
rating > 3 returns a list of boolean values, with true for ratings over 3 and false for those equal to or lower than 3.
This result can then be fed to a count() function, so that a nicely readable count(rating > 3)
returns the number of items with a rating over 3.
Lists returned from operations are flattened,
so that all data is either scalars or lists of scalars,
i.e. authors do not need to deal with lists of lists.
This was chosen to limit complexity,
as lists of lists are less common in naturally occurring data structures [3],
and create amgibuities for many operations
(e.g. what should count(list) return if list is a list of lists?
The number of lists or the total number of elements?).
Most formulas depend on the values of other data, and should update when those values change.
However, some use cases require access to context data that may change even if no data has changed,
and thus should influence when a formula is recalculated, essentially adding itself to its dependencies.
For example, a formula that displays the current time in hours and minutes should update at least every minute,
even if no data has changed.
In spreadsheets, this is expressed as a function, NOW().
However, function calls do not update dependencies, so NOW() has the confusing behavior that it does not actually reflect current time now,
but the time the row was last edited (which is a useful piece of information in its own right,
but needs a name that better describes its purpose).
In Formula², such computations are expressed as special, built-in properties
beginning with $.
For example, the current time is expressed as $now.
Other examples of special properties are:
$mouse, which provides the current mouse position and updates whenever the mouse moves.
$hash, which returns the current URL hash (without the # sign) and updates when the hash changes.
$today, which returns the current date and updates at midnight.
There are also several tree-structural special properties,
which are used to navigate the data tree.
For example, $parent to reference the parent of the current context node,
$previous and $next to reference its siblings,
$all to go from an item to its containing list,
or $index to get the index of the closest list item
(or any list item, when used as a property)
While the $ prefix is designed to prevent clashes with author identifiers
by partitioning the namespace,
we noticed in user studies (see Chapter 7) that authors often forget to use it,
presumably because this is an unnatural use for the $ symbol,
which in natural language is used for currency.
It remains an open problem how to partition the namespace in a way that is more intuitive to authors,
without making it overly verbose (as in the case of a reserved prefix, e.g. f2_now).
In line with its design principle of fault tolerance (Section 1.4.4),
Formula² will attempt to resolve special property identifiers without the $ prefix,
but with a lower priority than those with the $ prefix,
only if no data properties exist with these names.
It will also correctly resolve author identifiers that begin with $.
Strings, enclosed in single or double quotes.
Unquoted strings are also supported if they only consist of identifier-compatible characters,
but have lower precedence than data references.
Since the schema can be mutated at runtime,
this means their meaning could change if a new property is added anywhere in the data
that matches the unquoted string.
Booleans, true and false.
Empty values (aka null)
Lists (aka arrays or collections).
Lists can contain any data type.
There is no dedicated syntactic construct for lists, instead they are defined via a list() function.
Groups (aka objects or dictionaries), which are sequences of key-value pairs.
Groups are defined using a group() function and a colon to separate keys from values.
The colon (:) is actually an operator that returns an object with a single key-value pair.
The group function merely merges these objects.
This has the very nice consequence that when there is only a single key-value pair,
the group() function can be omitted.
Operators are essentially a nicer syntax to improve readability for certain functions,
and to reduce the need for balancing nested parentheses.
All operators are also implemented as functions (but the contrary is not true).
To improve learnability, only widely understood operators are symbols:
All other operators are words,
including
logical operators (and, or, not),
a modulo operator (mod),
filtering (where),
grouping (by / as).
These are designed to eliminate many common types of errors across novice programming and formula languages [20, 21]:
To avoid confusion between = and ==, = and == are both used for comparison.
It is common for + to do double duty: addition or concatenation, heuristically determined by the operands.
This usually leads to errors, when the heuristics predict author intent incorrectly.
Moreover, such a heuristic does not improve ergonomics since addition and concatenation are fundamentally distinct operations.
In Formula², + is only used for addition, and & is used instead for concatenation.
This allows + to work even when numbers are stored as strings, sparing novices from having to think about data types.
Unlike most programming or formula languages, comparison operators are n-ary.
For example, 3 < foo < 5 is perfectly legal and equivalent to (3 < foo) and (foo < 5).
In certain cases, Formula²’s innovation of implicit aggregation becomes its Achilles’ heel.
A big class of these are operations whose operands are not independent, but one operand is implicitly scoped by the other.
The most trivial such case is the narrowing (or dot) operator,
which is used to access properties of objects and to narrow down overly broad references.
With most operators, operands are resolved independently.
For example, in the schema shown in Figure 4.1, a formula like title & " " & due would evaluate title to all task titles, due to all task due dates, and would then concatenate them element-wise separated by a space.
However, if the same approach were followed with the dot operator,
it would produce nonsensical results.
Imagine if given a reference like students.name we evaluated students and name separately!
Instead, students essentially adds a constraint to the resolution of name:
it says “only consider name properties that are descendants of students”.
Whether this constraint is implemented by fetching all students objects and then looking within them, or by resolving name independently and then filtering it, is left up to implementations.
A less obvious example is the filtering operator (where).
To enable filtering of list-valued properties, Formula² supports a where operator and a corresponding filter() function.
The filter() function takes two lists
and returns a new list that contains only the elements of the first list for which the corresponding element in the second list is not empty, false, or 0.
Any surplus elements in the second list are ignored per existing list operation semantics (Section 4.4.2).
However, a where b is not simply syntactic sugar for filter(a, b).
There is one common ambiguity that where resolves:
should properties referenced in predicates be resolved by the context node of the expression as a whole, or in a different way?
This is best explained with an example.
Consider the schema in Section 4.7.2.
Assuming a decision item context node, what would you expect the expression pro where weight > 2 to return?
Presumably, you would strongly suspect it should return pro objects that have a weight property over 2.
However, if where were naïvely rewritten, it would be equivalent to filter(pro, weight > 2).
Like every function, each argument is resolved separately:
pro resolves to all pros of the context decision,
and weight > 2 resolves to a list of booleans …for all weights — including weights for cons!
To get the expected result with the filter() function,
we need to use the narrowing (dot) operator and write filter(pro, pro.weight > 2).
To prevent this terribly confusing behavior,
where is implemented to apply special scoping rules,
to preferentially resolve properties in the second operand
as descendants of the first operand.
This is not the same as naïvely prepending identifiers with a narrowing operator involving the first operand,
nor as using the first operand as a context node:
When we write foo.bar, if bar is not found within foo and its descendants,
the return value is empty2The implementation of Formula² that the Mavo HTML prototype embeds does not actually follow this; a bug that has caused a lot of confusion..
However, in the case of foo where bar > N, if bar is not found within foo and its descendants, we do want the search to continue as normal within the context node of the formula.
It could be argued that using foo as a context node and continuing the search to its ancestors is also reasonable, but in practice that is rarely desirable and has similar issues as dynamic scoping.
The closest rewriting would be (foo.bar or bar)
In some cases, it makes sense to evaluate an identifier using a different context node than the one of the whole formula.
One area where this is needed is nested aggregates, i.e. aggregates of aggregates.
We use nested aggregates when in a hierarchical schema we want to compute an aggregate within descendant lists
and then aggregate the aggregates at a higher level of the tree.
This may sound obscure in the abstract, but use cases that require nested aggregates are quite common in data-driven applications.
For example, assuming the data schema of a restaurant review website, average number of reviews is a nested aggregate: AVERAGE of COUNT.
To support this, Formula² supports an explicit scoping operator, in,
which evaluates its first operand in the context of the second operand.
The in operator has slightly different semantics around list values than other operators (Section 4.4.2):
if the second operand (the scope) is a list, the first operand is evaluated for each item in the list.
This is because it actually affects what the first operand resolves to,
so taking its shape into account would create a logical cycle.
The scoping operator facilitates nested aggregates using a concise, natural syntax.
Returning to the restaurant review example,
to calculate the average number of reviews per restaurant,
we would write average(count(reviews) in restaurants).
It can also facilitate complex object mapping operations, though the syntax can be a little more awkward.
For example, assume we have a list of right triangles (triangle) as objects with width and height properties.
We can easily calculate a list of hypotenuses with sqrt(pow(width, 2) + pow(height, 2)),
but what if we want to transform it to a list of triangles with width, height, and hypoteneuse properties?
This would work, but a novice would be unlikely to write it:
Due to the potential for confusion, its precedence is very high so that authors are forced to use parentheses to scope expressions with more than one term.
In many common cases, Formula²’s scoping heuristic can save authors time and effort.
However, as with many heuristics, there are cases where the heuristic would incorrectly predict author intent.
Per our design principles (Section 1.4.3), inferred author intent should be overridable.
For such cases, Formula² provides several ways to disambiguate references,
either by using more narrow scoping, or by providing ways to tweak reference resolution:
If ancestor values are undesirable, a narrowing operator (.) can be used to scope references to be within a specific object.
Special properties such as $parent, $previous, $next, $item, $this can be used to navigate the data tree of a given reference.
If the entire list of values is desired from a formula within a list, the $allspecial property can be used.
For example, in a list of restaurants, if the formula rating was on or within each restaurant,
it would resolve to a scalar value: the rating of that restaurant.
However, if rating is used outside the list of restaurants,
it would resolve to a list of ratings of all restaurants.
So what if we want to show where the current restaurant stands compared to all others in the list?
100 * count(rating.all > rating) / count(rating.all)
would give us the percentage of restaurants with a higher rating than the current one,
which would allow us to output things like “Top 5%”.
We identified temporal math as a common pain point when working with spreadsheets
and later validated that our hypothesis was true (see Section 7.6).
Yet, temporal computation is ubiquitous in so many types of web applications,
from use cases like personal tracking and project management to
simple things like showing the date a post was created, or how long has passed since.
Formula² provides a variety of functions to support manipulating and formatting temporal values
across various points of the ease-of-use to power spectrum.
Dates, times, and date/times are represented as ISO 8601 [22] strings.
If a mathematical operator is used with at least one string operand (which cannot be parsed as a number),
Fomula² will check if the string is a temporal value,
and will perform a temporal calculation if so.
This allows authors to specify temporal values by simply specifying them as strings,
with no additional effort to convert them to a specific data type.
Since no valid ISO dates are also valid numbers3The only exception is years, since something like "1997" is actually a valid ISO date.
However, this does not participate in the heuristic discussed here,
since if authors are doing math between years, there is nothing special to do — handling them as regular numbers works fine., this is a safe heuristic.
A variety of built-in functions and special properties are provided to make such operations natural to read and write.
For example date($today + 1 * day()) will return tomorrow’s date
and duration($now - "2019-07-12", 2) will return how long it has been since July 12th, 2019
in a human-readable format with two terms (e.g. “3 years, 2 months”).
Temporal expressions involving special properties such as $now should be updated by the host environment at a reasonable rate,
up to display’s framerate, to ensure that they are always up-to-date.
While Formula² does not specify optimizations for this (e.g there is no need to update readable_date($now) at 60 fps),
host environments are encouraged to do so.
The ISO 8601 format also supports timezones, and these are honored in any calculations.
However, it remains important future work to provide more primitives for their handling.
If no timezone is specified, the user’s local timezone is assumed.
Any functions that produce human readable output (such as many temporal functions) are locale-aware:
the host environment can optionally associate a locale with certain data nodes (any node without a locale inherits the locale of its parent),
and output follows this locale.
For example, in Mavo, the locale is derived from the closest HTML lang attrbute.
Ease and efficiency of parsing was a key design consideration.
The entirety of Formula2’s syntax can be parsed by a simple Pratt parser [7].
In fact, all that is needed to adapt a Pratt parser intended for JavaScript expressions4JSEP: ericsmekens.github.io/jsep
to a Formula² parser is to simply modify its operators.
Precedence and associativity of Formula² operators.
To generate a Pratt parser for Formula², the operator associativities and precedences are shown in Table 4.1.
On a high level5The productions do not enforce operator precedence, which is specified seprately, and do not describe the minutiae of number and string literals,
which are on par with JavaScript or similar languages., an EBNF grammar for Formula² could look like this:
After parsing, a number of AST transforms are applied to the parsed expression.
These include:
Flattening logical operators. For example, 3 < foo < 5 will be parsed as (3 < foo) < 5, which would produce an incorrect result.
Only logical operators need to be flattened.
Other operators are either correctly handled via normal precedence rules, or are associative and thus flattening would make no difference.
Rewriting operators into their equivalent functions (with further rewriting for operators that affect scoping, see Section 4.5.5).
If the left operand of : is an identifier, it is rewritten to a string literal.
While Formula2 is not dependent on any particular language or system,
to avoid excessive abstraction, we will describe its implementation in terms of JavaScript concepts.
For Formula² to be able to evaluate expressions against a data tree,
it first needs to walk the data tree to understand its structure and to store pointers from data objects
to this stored information.
This involves:
Walking the data tree to build a schema (unless this is precomputed by the host environment).
This is not a detailed data schema; for example it does not concern itself with data types at all.
It is simply a nested data structure of all possible property paths to help resolve identifiers.
Linking each node to its parent, so that traversal in any direction is possible from any node.
Linking each object to its corresponding schema node, so that traversal in any direction is possible.
Building a schema is not merely a performance improvement:
it is essential for reasonable identifier resolution.
Imagine if the meaning of identifiers changed simply because a collection happened to have no items at the moment of evaluation!
Of course it is also a performance improvement — performing a breadth first search on the data tree
for every identifier would be prohibitively slow even for modestly sized datasets.
The host environment needs to notify Formula² when the data tree changes in a way that affects the schema so it can be rebuilt,
otherwise expression evaluation could be incorrect.
We now present a few examples that highlight how Formula² provides improved ergonomics over JS,
spreadsheet formula languages, and SQL.
Each example schema has been chosen as a representative example of a common data shape.
First, a flat table.
Then, a schema with divergent one-to-many relationships with the same schema.
Finally, a schema with a deeper hierarchical structure.
For the comparison with JS, we will use the scoping convention of many JavaScript frameworks,
that identifiers directly below the context node can be referenced directly,
as well as ancestor properties, but everything else requires explicit scoping.
To compare with spreadsheets and SQL, there is a dilemma:
how to best convert a hierarchical schema to a tabular one
to maximize the comparison fairness?
There are three main ways:
Denormalization: Flatten the data structure into a single table with repeating values, akin to the result of a SQL join.
Normalization: Split the data structure into multiple tables, with foreign keys linking them.
2NF ([23]) is probably sufficient for this.
Blank cells: Use blank cells where (1) would repeat values.
The latter is what is most readable to humans and what they often naturally gravitate to,
as it does not require duplication (1), yet does not involve the complexity and indirection of relations (2).
Effectively it is trying to visually emulate a hierarchy, and it works — for (sighted) humans.
However, it is actively discouraged [24] as it breaks any type of data processing (e.g. formulas, filtering, sorting, etc.).
Repetition is a common way, albeit tedious to manage manually.
However, repetition becomes very awkward when we have divergent one-to-many relationships [10],
as is the case in one of the schemas below.
This only leaves us one option: normalization.
In many cases, defining additional data in the host UI (e.g. computed properties in Mavo, or additional columns in a spreadsheet)
would make some of these expressions a lot simpler,
but to ensure a fair comparison, we will only use the data as it is defined in the schema.
Additionally, we know from Section 7.6 that it is a barrier for novices when a computation requires auxiliary data to be defined in the UI.
For tasks that involve a context node, it is mentioned.
Otherwise, the context node is assumed to be the root of the data tree.
Here we want a data structure that will give us priorities and the number of tasks with that priority.
Aggregate functions like count() have a special behavior with the results of grouping,
and return a list with the counts of all groups.
We can get the group names either as (task by priority).tasktitle
or simply unique(taskTitle), since it is guaranteed to produce the same results, in the same order.
The colon operator can help us combine the two into a single object:
unique(taskTitle):count(task by priority)
This is not possible to do in spreadsheets as the result of a single formula.
Spreadsheets have a dedicated UI feature for this called pivot tables, although no-one tried to use them in our study (Section 9.7).
The closest we can get with the regular primitives is to
add a column E to the table, place the formula =COUNTIF(C2:C, C2) in E2, and drag it down.
Since UI interactions are not acceptable in this comparison, we will not consider this option.
We could get an arrayformula with a list of strings like “P0: 2”, “P1: 3”, etc.
like this:
This is a simple temporal computation task.
More than showcasing the strengths of Formula²,
this task highlights a weakness of the other three languages,
in a task that is very common in data-driven applications.
(taskTitle):duration($now - due)
The parentheses are needed because otherwise Formula2 would create an object with a single taskTitle property
(rather than evaluate taskTitle) due to the special handling of the first operand in the colon operator (see Section 4.6.1.2).
Spreadsheets have the same issue as with the previous task:
to use a simple formula, we need to perform UI interactions such as dragging down a formula.
Otherwise, our only recourse is unwieldy arrayformulas.
But here there is a much bigger issue at play: there is no high-level way to express an interval in a human-readable way.
Spreadsheet applications often provide UI for formatting a number as a duration, but this is not available in formulas and typically produces cryptic separated values of predefined units, so a duration like “2 months, 8 days” may appear like 936:00:00.
Our Lifesheets user study explores this more (see Section 9.7).
Displaying values in any even moderately human-readable way is painful.
Even if we only care about differences of a day or more, we need to do low-level wrangling like:
=TRIM(SUBSTITUTE(IF(DATEDIF(D1,NOW(),"Y")>0,DATEDIF(D1,NOW(),"Y")&" years, ","")&IF(DATEDIF(D1,NOW(),"YM")>0,DATEDIF(D1,NOW(),"YM")&" months, ","")&IF(DATEDIF(D1,NOW(),"MD")>0,DATEDIF(D1,NOW(),"MD")&" days",""),", ",""))
It gets even worse if we want to show time units as well, as one normally would in a task manager:
The ARRAYFORMULA version of this is left as an exercise for the reader.
JavaScript also does not provide a built-in way to format durations in a human-readable way.
In practice, this is often done with a library like moment.js or date-fns,
but for the sake of comparison, we will use a simple, naïve implementation:
task.map(t=>{let start =newDate(task.due);let end =newDate();let elapsed = now - start;let units =[{unit:'year',ms:365*24*60*60*1000},{unit:'month',ms:30*24*60*60*1000},{unit:'day',ms:24*60*60*1000},{unit:'hour',ms:60*60*1000},{unit:'minute',ms:60*1000}];let rtf =newIntl.RelativeTimeFormat('en',{numeric:'auto'});return units.reduce((result,{ unit, ms })=>{const value = Math.floor(elapsed / ms);if(value !==0){
result.push(rtf.format(-value, unit));// Negative value to represent the past
elapsed -= value * ms;}return result;},[]).join(', ');});
Standard SQL is not well-suited for this task either,
as it is not designed for temporal computations.
It is possible to do it, but it is also not pretty:
This is the (Mavo-inferred) schema from the decision-making application that was used in the first lab study (Section 7.1).
It’s a hierarchical schema with two levels of nesting,
where each decision item contains divergent one-to-many relationships.
{decision:[{decision:"Should I go to the party?",answer:null,// calculatedscore:null,// calculatedpro:[{argument:"Fun with friends!",weight:3}],con:[{argument:"I have tons of work",weight:1},{argument:"I will need to buy a present",weight:1}]},{decision:"Should I move?",answer:null,// calculatedscore:null,// calculatedpro:[{argument:"Reduced rent by $500",weight:2}],con:[{argument:"Expensive moving costs",weight:3}]}]}
decision
id
decision
score
answer
1
Should I go to the party?
2
Should I move?
pro
did
argument
weight
1
Fun with friends!
3
2
Reduced rent by $500
2
con
did
argument
weight
1
I have tons of work
1
1
I will need to buy a present
1
2
Expensive moving costs
3
Figure 4.6
The two schemas: hierarchical and tabular.
Since SQL does not have a context node concept,
we calculate all scores:
SELECT decision.id,sum(pro.weight)-sum(con.weight)FROM decision
LEFTJOIN pro ON decision.id = pro.did
LEFTJOIN con ON decision.id = con.did
GROUPBY decision.id
SELECT
d.id,
d.decision,AVG(argument_count)AS average_arguments
FROM(SELECT
p.did AS id,COUNT(p.argument)+COUNT(c.argument)AS argument_count
FROM
pro p
LEFTJOIN
con c ON p.did = c.did
GROUPBY
p.did)AS argument_counts
JOIN
decision d ON d.id = argument_counts.id
GROUPBY
d.id, d.decision;
This is the schema from the restaurant review log that was used in the first lab study (Section 7.1).
It’s a hierarchical schema with three levels of nesting.
{"restaurant":[{"picture":"https://www.toscanoboston.com/common/images/thumb-cucina.jpg","url":"http://www.toscanoboston.com/beacon-hill","name":"Toscano","visit":[{"date":"2016-03-15","title":"Date night!","dish":[{"name":"Filet mignon with black truffle and foie gras","dishRating":4}]}]}]}
SELECT
r.id AS restaurant_id,
r.name AS restaurant_name,AVG(vr.average_dish_rating)AS restaurant_rating
FROM restaurant r
JOIN visit v ON r.id = v.rid
JOIN(SELECT vid,AVG(dishRating)AS average_dish_rating
FROM dish
GROUPBY vid
) vr ON v.rid = vr.vid
GROUPBY
r.id, r.name;
WITH restaurant_ratings AS(SELECT
r.id AS restaurant_id,AVG(vr.average_dish_rating)AS restaurant_rating
FROM restaurant r
JOIN visit v ON r.id = v.rid
JOIN(SELECT
vid,AVG(dishRating)AS average_dish_rating
FROM dish
GROUPBY vid
) vr ON v.id = vr.vid
GROUPBY r.id
)SELECTCOUNT(*)AS good_restaurant_count
FROM restaurant_ratings
WHERE restaurant_rating >3;
Formula2 was designed to make the kinds of computations that are common in small-scale data-driven applications easier to express by novices.
There are numerous use cases arising in general application development that are currently either awkward or impossible to express in Formula2.
While Formula2 can express simple mapping operations very easily,
such as mapping objects to a descendant property or a combination of descendant properties,
many of the kinds of arbitrary mapping operations that a developer can accomplish in an imperative language via a loop are not possible,
or awkward to express with Formula2 alone.
A large class of such use cases is augmenting objects,
such as the triangle hypotenuse example in Section 4.5.5.3,
which can be expressed but the formula to express them is beyond the capabilities of most novices.
One big category is parsing tasks: producing structured data from a string of text,
for example color components from a serialized color, or SVG path segment information from an SVG path string (see Section 8.2.1).
Perhaps a novice-friendly way to express simple patterns could be developed and exposed in Formula2 via suitable functions,
but currently this remains an open question.
Possibly one of the most questionable design decisions of Formula2 is its flattening of multi-dimensional arrays.
E.g. list(1, list(2, 3), list(4, 5)) is flattened to list(1, 2, 3, 4, 5).
The intent behind this was to simplify the number of distinct cases that need to be handled,
since these structures did not naturally occur in Mavo apps,
and since every multi-dimensional array can be expressed as an array of objects.
However, since the original design, some cases have emerged where this flattening is undesirable.
The main example is grouping by nested lists.
Consider a list of people (person), each of whom has a list of hobbies (just a list of strings) (hobby).
Currently, person by hobby would not produce a reasonable result, since it depends on element-wise matching.
Perhaps a compromise solution could be to use a data structure that behaves like a flattened array, but is not actually flattened.
Or the opposite: a flattened array that preserves metadata about the boundaries of its constituent arrays.
We opted for a functional syntax, as it appears to be easier for novices to understand,
and may be familiar from spreadsheets.
We hypothesized that for example, count(rating) is easier to understand than rating.count().
However, it does come with the drawback of authors having to manage nested parentheses,
a common authoring mistake.
Additionally, if some functions are available as functions and some as methods,
it increases the cognitive load on the author to remember which is which.
In line with our design principle of robustness,
it appears an optional dot notation could greatly increase the efficiency and safety of Formula²,
without compromising its learnability.
Rather than authors having to remember which functions are methods,
every function would be available as a method and vice versa.
arg.foo(arg1, arg2) would be equivalent to foo(arg, arg1, arg2).
This would allow authors to use the syntax that is most convenient for their use case,
and feels most natural to them,
without introducing additional error conditions or cognitive tax.
We have repeatedly seen in our user studies that novices struggle with the concept of identifiers having a restricted syntax.
Furthermore, imposing restrictions on identifier names makes it awkward to work with data created by others,
which may not conform to these restrictions.
To prevent syntactic ambiguity, allowing unrestricted identifiers would require a different syntax
to tell the parser that a series of symbols is actually an identifier,
and delineate where it begins and ends.
It appears that brackets ([]) could be a good fit for this purpose, and have some precedent ([10]).
However, this would require substantial changes in Mavo, which uses brackets to embed Formula² expressions in literal text.
For functions returning a numerical value with no arguments, and constants,
it is often more readable to be able to express multiplications with a fixed number as a number with a unit.
For example, date($today + 2 * days()) will return the date two days from now.
But it might be a lot more natural if we could write date($today + 2days).
Or, rather than calculating a circle’s circumference as radius * 2 * pi,
it may be more readable to write radius * 2pi.
Formula² could implement this as a general language construct: a number followed by an identifier
is syntactic sugar for the multiplication of the number and the value of the identifier if the
identifier is a number,
or the value of calling the identifier as a function with no arguments if the identifier is a function.
This could be a property of specific functions,
since it only makes sense for a relatively small percentage of functions.
Alternatively, the number and type of quantity could be retained in an object that can be coearced to a number when needed.
This could allow expressing a lot more values as first-class citizens, e.g. currencies, temperatures, measurements, etc.
While performance optimizations do not affect the language itself,
in practice they can have a significant impact on the user experience.
Currently, the only performance optimization that the prototype implementation of Formula2 employs is
the use of a schema to precompute all possible property paths.
Even caching the result of an expression so that expressions are not re-evaluated if the data has not changed, is relegated to the host environment.
However, given the lack of abstractions in the formulas novices write (see [25] and Section 9.8),
more elaborate caching mechanisms are essential.
By moving caching within Formula2, individual operands can be independently cached,
so that for example task by priority and count(task by priority) only needs to compute the grouping once.
Additionally, a common pattern we have observed (see Chapter 7) is for certain properties to be repeatedly used
to filter the same structure, often acting essentially as implicit primary keys.
For an illustrative example,
consider a collection of books and their metadata (possibly fetched from a remote API),
and a reading log with a list of books read, ratings, and notes, maintained by the app user.
The app author could display a dropdown of books to associate each entry with,
and store the selected book ID in a bookId property.
Then, to display book metadata next to each entry, the author would have to run a filtering operation such as
books where id = bookId.
The problem with that is that this would iterate over the entire books collection for each entry,
making lookups O(N2).
These types of tasks are the bread and butter of relational database systems: you simply declare bookId as a foreign key,
and assuming book.id is a primary key, the database will take care of the rest.
However, this is nontrivial for novices and end-user programmers, who struggle to think in terms of relations [26]
and are goal-oriented, and therefore averse to preparatory work such as setting up schemas and indices.
However, perhaps end-users don’t need to do this work.
Assuming Formula2 provided a mechanism for declaring primary keys and indices,
creating them could be handled by the host environment,
which has a lot more information about what expressions may be evaluated and how frequently.
To use Mavo as an example, a simple heuristic would be to automatically create indices for any property used in a where or by clause
within a collection.
Another (not mutually exclusive) direction could be for Formula2 to automatically optimize expressions
based on a combination of factors such as how frequently they are evaluated,
and how large the collections they operate on are.
Most common programming languages support short-circuit evaluation for logical and conditional(ternary) operators.
Later operands are only evaluated if earlier operands do not suffice to determine the value of the expression.
For example, in a and b, if a is false, b is not evaluated, since we already know that the overall value must be false.
Due to its handling of list-valued operations, Formula² cannot support short-circuit evaluation,
as all operands need to be examined to determine the shape of the result.
This is not an issue for its use as a reactive formula language,
but becomes one once side effects are introduced (such as by Data Update Actions, see Section 6.3.4).
In this chapter, we presented Formula2, a formula language designed for end-user programming in data-driven web applications,
and optimized for hierarchical schemas.
Our user studies (described in Chapter 7) have shown that Formula2 is easy to learn for novices,
who often did not believe that expressions they wrote could work.
While Mavo is currently the only deployed Formula2 host environment,
we believe its potential is much broader, and look forward to seeing additional implementations in the wild.
This is a simplification and would result in very poor performance.
In practice, the Breadth-First Search is performed on a precomputed schema of the data,
and then the paths are resolved on the actual data. ↩︎
The implementation of Formula² that the Mavo HTML prototype embeds does not actually follow this; a bug that has caused a lot of confusion. ↩︎
The only exception is years, since something like "1997" is actually a valid ISO date.
However, this does not participate in the heuristic discussed here,
since if authors are doing math between years, there is nothing special to do — handling them as regular numbers works fine. ↩︎
The productions do not enforce operator precedence, which is specified seprately, and do not describe the minutiae of number and string literals,
which are on par with JavaScript or similar languages. ↩︎
Burnett, M., Atwood, J., Djang, R.W., Reichwein, J., Gottfried, H. and Yang, S. 2001. Forms/3: A first-order visual language to explore the boundaries of the spreadsheet paradigm. Journal of Functional Programming. 11, (2001), 155–206.
Cited in1
Galassi, G. and Mattessich, R.V. 2014. Some clarification to the evolution of the electronic spreadsheet. Journal of Emerging Technologies in Accounting. 11, (2014), 99–104.
Cited in1, and
2
Benson, E. and Karger, D.R. 2014. End-users publishing structured information on the Web: An observational study of what, why, and how. ACM CHI Conference on Human Factors in Computing Systems (2014), 1265–1274. 10.1145/2556288/2557036.
Cited in1, and
2
Verou, L., Zhang, A.X. and Karger, D.R. 2016. Mavo: Creating interactive data-driven web applications by authoring HTML. UIST 2016 - Proceedings of the 29th Annual Symposium on User Interface Software and Technology (2016), 483–496. 10.1145/2984511.2984551.
Cited in1
Myers, B.A., Ko, A.J., Scaffidi, C., Oney, S., Yoon, Y., Chang, K., Kery, M.B. and Li, T.J.-J. 2017. Making End User Development More Natural. New Perspectives in End-User Development. F. Paternò and V. Wulf, eds. Springer International Publishing. 1–22. 10.1007/978-3-319-60291-2_1.
Cited in1
Pratt, V.R. 1973. Top down operator precedence. Proceedings of the 1st annual ACM SIGACT-SIGPLAN symposium on Principles of programming languages - POPL ’73 (Boston, Massachusetts, 1973), 41–51. 10.1145/512927.512931.
Cited in1, and
2
Chang, K.S.-P. and Myers, B.A. 2016. Using and Exploring Hierarchical Data in Spreadsheets. Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (New York, NY, USA, May 2016), 2497–2507. 10.1145/2858036.2858430.
Cited in1
Chang, K.S.-P. and Myers, B.A. 2017. Gneiss: spreadsheet programming using structured web service data. Journal of Visual Languages & Computing. 39, (Apr. 2017), 41–50. 10.1016/j.jvlc.2016.07.004.
Cited in1
Bakke, E. and Karger, D.R. 2016. Expressive Query Construction through Direct Manipulation of Nested Relational Results. Proceedings of the 2016 International Conference on Management of Data (New York, NY, USA, Jun. 2016), 1377–1392. 10.1145/2882903.2915210.
Cited in1,
2, and
3
Green, T.R.G. and Petre, M. 1996. Usability Analysis of Visual Programming Environments: A ‘Cognitive Dimensions’ Framework. Journal of Visual Languages & Computing. 7, (Jun. 1996), 131–174. 10.1006/jvlc.1996.0009.
Cited in1
Wang, G., Yang, S. and Han, Y. 2009. Mashroom: end-user mashup programming using nested tables. Proceedings of the 18th international conference on World wide web (New York, NY, USA, Apr. 2009), 861–870. 10.1145/1526709.1526825.
Cited in1
Pane, J.F., Myers, B.A., and others 2001. Studying the language and structure in non-programmers’ solutions to programming problems. International Journal of Human-Computer Studies. 54, (2001), 237–264. 10.1006/ijhc.2000.0410.
Cited in1
Ma, L., Ferguson, J., Roper, M. and Wood, M. 2011. Investigating and improving the models of programming concepts held by novice programmers. Computer Science Education. 21, (2011), 57–80.
Cited in1
Mase, M.B. and Nel, L. 2022. Common Code Writing Errors Made by Novice Programmers: Implications for the Teaching of Introductory Programming. ICT Education (Cham, 2022), 102–117. 10.1007/978-3-030-95003-3_7.
Cited in1
Panko, R.R. 2013. The cognitive science of spreadsheet errors: Why thinking is bad. Proceedings of the Annual Hawaii International Conference on System Sciences (2013). 10.1109/HICSS.2013.513.
Cited in1
154, I. 2019. ISO standard. ISO 8601-1:2019 - Date and time — Representations for information interchange — Part 1: Basic rules: https://www.iso.org/standard/70907.html. Accessed: 2022-09-13.
Cited in1
Broman, K.W. and Woo, K.H. 2018. Data Organization in Spreadsheets. The American Statistician. 72, (Jan. 2018), 2–10. 10.1080/00031305.2017.1375989.
Cited in1
Scaffidi, C. and Shaw, M. 2010. Chapter 21 - Reuse in the world of end user programmers. No Code Required. A. Cypher, M. Dontcheva, T. Lau, and J. Nichols, eds. Morgan Kaufmann. 407–421. 10.1016/B978-0-12-381541-5.00021-3.
Cited in1
Jagadish, H.V., Chapman, A., Elkiss, A., Jayapandian, M., Li, Y., Nandi, A. and Yu, C. 2007. Making database systems usable. Proceedings of the 2007 ACM SIGMOD international conference on Management of data (New York, NY, USA, Jun. 2007), 13–24. 10.1145/1247480.1247483.
Cited in1