Natural joins

Parallel operations performed over all the lines of a given table are very useful, however more often than not, multiple tables are involved. Envision provides several mechanisms in order to perform operations that involve multiple tables, the most simple ones being dimensions and broadcasting, which offer the possibility to rely on natural joins.

Table of contents

Advanced remark: the whole point of natural joins in Envision is to omit entirely the JOIN expression as found in SQL. Indeed, when writing numerical recipes, there are usually only a few joins involved, but those joins end-up being repeated ad nauseam at every single step of the calculation. Envision instead provides a tool (“dimensions”) to specify the relationships between tables and then deduces the natural joins from those specifications.

Dimensions

A dimension is a special vector that characterizes a table. A table can have several dimensions. Every primary dimension is the primary dimension of exactly one table, but some tables do not have one primary dimension. The scalar table has no primary dimension. The cross tables, discussed later, have two primary dimensions or more.

The following script creates a Products table that has pid as its primary dimension:

table Orders = with
  [| as Pid, as Date, as Quantity |]
  [| "apple",  date(2020, 4, 15), 3 |]
  [| "apple",  date(2020, 4, 16), 7 |]
  [| "orange", date(2020, 4, 16), 2 |]

table Products[pid] = by Orders.Pid

show table "Products" with pid

In the script above, the table Products is constructed as a grouping table. The dimension pid is the primary dimension of the table Products but also, by construction, a (non-primary) dimension of the table Orders. We will be revisiting in greater detail this table construction mechanism in the following.

Leveraging the short-name of the dimension pid that omits the table name, the displayed table is:

pid
apple
orange

However, the long name of the dimension Products.pid could have been used instead. The logic remains identical, but the last line of the script could have been replaced by:

show table "Products" with Products.pid

Implicit broadcasts

For leveraging the dimensions that are found in the tables, Envision defines broadcasting behaviors, which logically consists of copying the values of one table to another. During a broadcast, one line from the source table can be copied to zero or more lines to the destination tables. We say that the values of a given table can be broadcast to another table. Let’s illustrate these broadcasts with:

table Orders = with
  [| as Pid, as Date, as Quantity |]
  [| "apple",  date(2020, 4, 15), 3 |]
  [| "apple",  date(2020, 4, 16), 7 |]
  [| "orange", date(2020, 4, 16), 2 |]

table Products[pid] = by Orders.Pid

Products.Producer = "Contoso"       // broadcast Scalar -> Products
Orders.Producer = Products.Producer // broadcast Products -> Orders

show table "Orders" with Orders.Producer

First, the scalar literal "Contoso" is broadcast to the Products table. This broadcast is legal because the scalar table can - by construction - be broadcast to any other table without any ambiguity. Indeed, the scalar table has a single line, thus no matter how many lines are found in the receiving table, it’s the same value that gets copied.

Second, the Products.Producer variable is broadcast to the Orders table. This broadcast is legal because Products is a grouping table that has been produced from Orders in the first place. Thus, every line in Orders is associated to one and exactly one line in Products without any ambiguity.

The most common type of broadcasts in Envision are the implicit broadcasts where the expression found on the right side of the assignment operator is mapped to the table specified on the table specified on the left side.

However, not all broadcasts are legal. For example, the following script does not compile :

table Orders = with
  [| as Pid, as Date, as Quantity |]
  [| "apple",  date(2020, 4, 15), 3 |]
  [| "apple",  date(2020, 4, 16), 7 |]
  [| "orange", date(2020, 4, 16), 2 |]

table Products[pid] = by Orders.Pid

Orders.Producer = "Contoso"
Products.Producer = Orders.Producer // WRONG! Cannot broadcast Orders data to Products data.

show table "Products" with Products.Producer

While "Contoso" can be broadcast to Orders (or any table actually), the variable Orders.Producer cannot be broadcast to Products. Each Orders line is mapped to exactly one Products line, but the inverse is not true.

Envision offers many ways to create or read tables while specifying the dimensions. The dimensions are particularly interesting because they allow Envision to establish whether a table can broadcast to another table.

Advanced remark: While there is nothing fundamentally novel about numeric broadcasts - Fortran has been supporting it for decades - with the advent of deep learning, there has been a renewed interest in this style of numeric operations, not only because it’s concise and elegant but also because it lends itself to a high degree of data parallelism, which nicely fits the capabilities of modern computing hardware, at least under favorable conditions. Indeed, arbitrary broadcasts - implemented as Gather - have a notoriously low performance on GPUs. Most deep learning toolkits feature broadcast behaviors geared around tensors, which are more suitable for high-performance computing. Envision makes these behaviors first class citizens of the language instead of delegating them to a specialized library.

Roadmap : The table comprehension syntax is planned to be extended to support specifying the primary dimension within the comprehension itself, instead of having it segregated into a second grouping statement as done above.

Explicit broadcasts

While implicit broadcasts are more concise, and usually more readable as well, Envision also supports a mechanism for performing explicit broadcasts at the expression level with the keyword into. This capability usually helps to reduce the verbosity of the Envision code, allowing to express in a single assignment what would have otherwise taken multiple lines of script.

The usage of the keyword into is illustrated in the following script:

table Orders = with
  [| as Pid, as OrderDate, as Quantity |]
  [| "apple",  date(2020, 4, 15), 3 |]
  [| "apple",  date(2020, 4, 16), 7 |]
  [| "orange", date(2020, 4, 16), 2 |]

where true into Orders // explicit broadcast
  show table "Nothing filtered" with
    Orders.Pid
    Orders.OrderDate
    Orders.Quantity

A filter where true is not legal (the Scalar table cannot be filtered using where), and furthermore, it does not point out the intended table Orders. The into keyword can be suffixed after an expression to indicate the intended table.

The script above can be rewritten in a more verbose form without using the into keyword with:

table Orders = with
  [| as Pid, as OrderDate, as Quantity |]
  [| "apple",  date(2020, 4, 15), 3 |]
  [| "apple",  date(2020, 4, 16), 7 |]
  [| "orange", date(2020, 4, 16), 2 |]

Orders.IsTrue = true // implicit broadcast

where Orders.IsTrue
  show table "Nothing filtered" with
    Orders.Pid
    Orders.OrderDate
    Orders.Quantity

Also, it would have been possible to rewrite the script further by leveraging the table prefix syntax on literals, as illustrated by:

table Orders = with
  [| as Pid, as OrderDate, as Quantity |]
  [| "apple",  date(2020, 4, 15), 3 |]
  [| "apple",  date(2020, 4, 16), 7 |]
  [| "orange", date(2020, 4, 16), 2 |]

where Orders.true // explicit broadcast
  show table "Nothing filtered" with
    Orders.Pid
    Orders.OrderDate
    Orders.Quantity

Then, a slightly more realistic example could be considered with the following script:

table Orders = with
  [| as Pid, as OrderDate, as Quantity |]
  [| "apple",  date(2020, 4, 15), 3 |]
  [| "apple",  date(2020, 4, 16), 7 |]
  [| "orange", date(2020, 4, 16), 2 |]

table Products[pid] = by Orders.Pid

Products.Sold = sum(Orders.Quantity)

where (Products.Sold > 4) into Orders
  show table "Only 1 order filtered" with
    pid // auto-broadcast to upstream table 'Orders'
    Orders.OrderDate
    Orders.Quantity
  
  show table "Nothing filtered" with
    pid
    Products.Sold

Which displays the following tables:

Pid OrderDate Quantity
apple Apr 15, 2020 7
apple Apr 16, 2020 3

and

Pid Sold
apple 10
orange 2

While it’s the Products table that appears in the Boolean expression of the where block, this expression is explicitly broadcast to the Orders table. As a result, the Orders table gets filtered, but the Products table does not.

Common tables

With very few exceptions, any operation in Envision that involves multiple vectors requires those vectors to be in the same table. For convenience, if the provided vectors are in different tables, Envision will attempt to identify a common table and use implicit broadcasts to bring all the vectors into that table. The script below illustrates this mechanism:

table Variants = with
  [| as Product, as Size |]
  [| "shirt", "small" |]
  [| "shirt", "medium" |]
  [| "pants", "small" |]
  [| "socks", "medium" |]

table Products[Product] = by Variants.Product

show table "Variants" with
  Products.Product
  Variants.Size

Which displays the following table:

Product Size
shirt small
shirt medium
pants small
socks medium

The show table expects all vectors to be in the same table, but the two provided vectors are in the two tables Products and Variants. Envision correctly identifies Variants as the common table, and automatically performs a broadcast of the Products.Product vector toward the Variants table.

The common table is always one of the source tables, with one exception: if there are exactly two source tables, none of which support broadcasting into the other, and there exists exactly one cross-table with the source tables as its left and right components, then that cross-table is used as the common table.

Advanced remark: while it rarely makes sense from a supply chain perspective, it is possible to setup a situation where both source tables can be used as the common table, making the choice ambiguous. In that situation, the first source table is used as the common table. This is illustrated by the following script:

table Variants[Vid] = with
  [| as Product, as Size |]
  [| "shirt", "small" |]
  [| "shirt", "medium" |]
  [| "pants", "small" |]
  [| "socks", "medium" |]

table Alts[Alt] = by [Vid, Variants.Size]
Alts.TVid, _ = Alt          // decomposing the tuple
expect Alts.Vid = Alts.TVid // add secondary dimension to 'Alts'

Alts.Foo = Variants.1 // broadcast 'Variants -> Alts'
Variants.Foo = Alts.1 // broadcast 'Alts -> Variants'

x = sum(Alts.Foo + Variants.Foo) // 'Alts' is the common table
y = sum(Variants.Foo + Alts.Foo) // 'Variants' is the common table

show summary "Resolved ambiguity" a1b1 with x, y

The script above creates two tables named Variants and Alts. The table Alts is a group table, which will be covered in greater detail in the section Non scalar aggregation. The design allows the table Alts to be broadcast to Variants as Alts is a group table originating from Variants. The design also allows the table Variants to be broadcast into the table Alts because Alts features Vid among its dimensions, which is the primary dimension of Variants.

Then, in the script above the operation Alts.Foo + Variants.Foo is ambiguous because both Alts and Variants could serve a common table. Yet, due to the rule listed above, Alts is probed first, hence, it is selected as the common table.

Lookups

Lookups allow dictionary-style data accesses in Envision. A key is specified to identify a line of interest in a table. The key is intended to match a value found in the primary dimension of the table associated with the vector. The following script illustrates the lookup syntax:

table Orders[Pid] = with
  [| as Pid, as Date, as Quantity |]
  [| "apple",  date(2020, 4, 15), 3 |]
  [| "pear",  date(2020, 4, 16), 7 |]
  [| "orange", date(2020, 4, 17), 2 |]

show scalar "" with Orders.Date["orange"] // 'Apr 17, 2020'

In the above script, Orders.Date["orange"] is the lookup operation. The key belongs to the scalar table, and as a result, the lookup result is a scalar as well.

While lookups are frequently used in many languages, they are quite a rare occurrence in Envision scripts. As a rule of thumb, in Envision, natural joins should be favored whenever possible.

When performing a lookup operation, the resulting vector belongs to the same table as the one holding the key values. The following script illustrates this principle:

table Products[product] = with
  [| as Product,   as Price |]
  [| "apple",      1.50     |]
  [| "pear",       1.30     |]
  [| "orange",     2.10     |]
  [| "clementine", 2.70     |]

table Selection = with
  [| as Choice |]
  [| "pear"    |]
  [| "orange"  |]

show table "Selection" a1b4 with
  Selection.Choice
  Products.Price[Selection.Choice] as "Price"

In the above script, Products.Price[Selection.Choice] evaluates as a vector that belongs to the Selection table.

In summary, there are two rules for lookups. First, the output table is always the same as the table in which the keys are provided. Second, all keys must be provided by that table either explicitly or implicitly. Natural joins represent the implicit flavor of the lookup: the primary dimension found in the table is used as keys. In a later section, we will see how cross tables introduce a variant of this mechanism.

It is also possible to have a lookup performed on a table upon itself as illustrated below:

table Products[product] = with
  [| as Product,   as Price, as Substitute  |]
  [| "apple",      1.50,    "pear"          |]
  [| "pear",       1.30,    "apple"         |]
  [| "orange",     2.10,    "clementine"    |]
  [| "clementine", 2.70,    "orange"        |]

show table "Products" a1d4 with
  product
  Products.Price
  Products.Substitute
  Products.Price[Products.Substitute] as "Substitute Price"

Advanced remark: Lookups in Envision are reminiscent, to some extent, to Microsoft Excel’s VLOOKUP. Syntax-wise, they are somewhat similar to dictionary lookups in Python, JavaScript and many other languages.

Default and lookups

If a lookup is requested specifying a key value that is absent from the primary dimension, then the default value for the data type is used, as illustrated below:

table Products[product] = with
  [| as Product,   as Price |]
  [| "apple",      1.50     |]
  [| "pear",       1.30     |]
  [| "orange",     2.10     |]
  [| "clementine", 2.70     |]

table Selection = with
  [| as Choice |]
  [| "pear"    |]
  [| "orange"  |]
  [| "banana"  |] // missing

show table "Selection" a1b4 with
  Selection.Choice
  Products.Price[Selection.Choice] as "Price" // '0' on 'banana'

However, it is also possible to specify an alternative default value through the keyword default, as illustrated below:

table Products[product] = with
  [| as Product,   as Price |]
  [| "apple",      1.50     |]
  [| "pear",       1.30     |]
  [| "orange",     2.10     |]
  [| "clementine", 2.70     |]

table Selection = with
  [| as Choice |]
  [| "pear"    |]
  [| "orange"  |]
  [| "banana"  |] // missing

show table "Selection" a1b4 with
  Selection.Choice
  Products.Price[Selection.Choice] default -1 as "Price" //  '-1' on 'banana'

Finally, if defaulting isn’t supposed to happen in the first place then default fail can be used to generate a runtime error, as illustrated by:

table Products[product] = with
  [| as Product,   as Price |]
  [| "apple",      1.50     |]
  [| "pear",       1.30     |]
  [| "orange",     2.10     |]
  [| "clementine", 2.70     |]

table Selection = with
  [| as Choice |]
  [| "pear"    |]
  [| "orange"  |]
  [| "banana"  |] // missing

show table "Selection" a1b4 with
  Selection.Choice
  Products.Price[Selection.Choice] default fail

As a rule of thumb, if the value returned by the lookup is consumed by a later calculation - as opposed to be just immediately displayed - and if there is no “obvious” value to default to, it is recommended to use default fail. Indeed, GIGO (garbage in, garbage out) problems are worse than clean failures while both require bugfixes anyway.

Tuples and cross-tables

When using a lookup into a cross-table, there are two primary dimensions for which the lookup can provide keys. It is possible to provide either of them, or both at the same time. Consider the following example:

table Sizes[size] = with 
  [| as size, as Weight |]
  [| "S",     0.5       |]
  [| "M",     0.75      |]
  [| "L",     1         |]
  [| "XL",    0.75      |]

table Colors[color] = with 
  [| as color, as Weight |]
  [| "white",  1         |]
  [| "red",    2         |]
  [| "green",  0.5       |]
  [| "blue",   0.75      |]
  [| "black",  1         |]

table Catalog = with 
  [| as Size, as Color, as AltColor, as Price |]
  [| "S",     "red",    "blue",      9.99     |]
  [| "M",     "blue",   "white",     10.99    |]
  [| "XL",    "black",  "green",     15.49    |]

expect Catalog.size = Catalog.Size
expect Catalog.color = Catalog.Color

table Variant = cross(Sizes, Colors)
Variant.W = Sizes.Weight * Colors.Weight 

// When both dimensions are provided (by name), behaves as a normal lookup.
WhiteLarge = Variant.Weight[size: "L", color: "white"]
Catalog.Alt = Variant.W[size: Catalog.size, color: Catalog.AltColor]

// If a single dimension is provided by name, behavior depends on whether key
// is a scalar or not. 

// If scalar, the result is a vector in the other table of the cross-table: 
Colors.Small = Variant.W[size: "S"]
Sizes.Blue = Variant.W[color: "blue"]

// If non-scalar, the lookup will automatically look for the other 
// dimension in the key's table. 
Catalog.Alt = Variant.W[color: Catalog.AltColor] // implicit 'size: Catalog.size'

// Note that if no label is provided, the rightmost dimension is assumed: 
Catalog.Alt = Variant.W[Catalog.AltColor] // same as the abvoe

In summary, the rules are:

Lookups using lag syntax

When the key of a lookup is a number literal, such as T.X[+2] or T.X[-3], this is called lag syntax and is a shorthand notation for T.X[T.dim + 2] and T.X[T.dim - 3] (assuming dim is the name of the dimension being looked up). The name lag comes from the fact that this syntax is usually employed to perform a time lag operation on a time series. For example, the script below computes the month-to-month variation, for each item.

table Catalog[ref] = with 
  [| as ref |]
  [| "A001" |]
  [| "B002" |]

// Creates 'Month' table with 'month' primary dimension
keep span date = [date(2024, 1, 1) .. date(2024, 12, 31)]

table CatalogMonth = cross(Catalog, Month)
CatalogMonth.Sales = random.uniform(1, 10)

CatalogMonth.SalesLag = CatalogMonth.Sales[-1]

// Equivalent to: 
CatalogMonth.SalesLag = CatalogMonth.Sales[
    ref: CatalogMonth.ref,
    month: CatalogMonth.month - 1]

CatalogMonth.MonthlyIncrease = CatalogMonth.Sales - CatalogMonth.SalesLag

That is, for a given product and month, CatalogMonth.SalesLag contains the value of the sales for the same product, on the previous month.

Tuples and lookups

The keys passed to lookups are index expressions, and so can be provided as tuples.

Tuples can also be used to perform lookups whenever a table’s primary dimension happens to be typed as a tuple. The following script illustrates this behavior with a table of SKUs where the primary dimension is a tuple that includes both a location and a product reference:

table Raw = with
  [| as Loc, as Ref, as OnHand |]
  [| "Paris", "hat", 1 |]
  [| "Paris", "shirt", 3 |]
  [| "New York", "shirt", 2 |]
  [| "New York", "pant", 5 |]

table SKUs = single by (Raw.Loc, Raw.Ref)

table Query = with
  [| as Loc, as Ref |]
  [| "Paris", "shirt" |]
  [| "Paris", "pant" |]
  [| "New York", "shirt"|]

show table "Query result" a1c4 with
  Query.Loc
  Query.Ref
  SKUs.OnHand[Query.Loc, Query.Ref] // 3, 0, 5

In the above script, the table SKUs is defined with a tuple as its primary dimension as the by keyword gets two vectors as arguments. The last line SKUs.OnHand[Query.Loc, Query.Ref] is the tuple lookup itself, which returns 0 (zero) if the entry is not found in the primary dimension of SKUs.

User Contributed Notes
0 notes + add a note