Parallel operations performed over all the lines of a given table are very useful, however more often than not, multiple tables are involved. Envision provides several mechanisms in order to perform operations that involve multiple tables, the most simple ones being dimensions and broadcasting, which offer the possibility to rely on natural joins.
Table of contents
Advanced remark: the whole point of natural joins in Envision is to omit entirely the JOIN expression as found in SQL. Indeed, when writing numerical recipes, there are usually only a few joins involved, but those joins end-up being repeated ad nauseam at every single step of the calculation. Envision provides mechanisms to specify the relationships between tables and then lets all the calculations flow accordingly.
A dimension is a special vector that characterizes a table. A table can have several dimensions. Every primary dimension is the primary dimension of exactly one table, but some tables do not have one primary dimension. The scalar table has zero primary dimension. The cross tables, discussed later, have two primary dimensions or more.
The following script creates a
Products table that has
pid as its primary dimension:
table Orders = with [| as Pid, as Date, as Quantity |] [| "apple", date(2020, 4, 15), 3 |] [| "apple", date(2020, 4, 16), 7 |] [| "orange", date(2020, 4, 16), 2 |] table Products[pid] = by Orders.Pid show table "Products" with pid
In the script above, the table
Products is constructed as a grouping table. The dimension
pid is the primary dimension of the table
Products but also, by construction, a (non-primary) dimension of the table
Orders. We will be revisiting in greater detail this table construction mechanism in the following.
Leveraging the short-name of the dimension
pid that omits the table name, the displayed table is:
However, the long name of the dimension
Products.pid could have been used instead. The logic remains identical, but the last line of the script could have been replaced by:
show table "Products" with Products.pid
For leveraging the dimensions that are found in the tables, Envision defines broadcasting behaviors, which logically consists of copying the values of one table to another. During a broadcast, one line from the source table can be copied to zero or more lines to the destination tables. We say that the values of a given table can be broadcast to another table. Let’s illustrate these broadcasts with:
table Orders = with [| as Pid, as Date, as Quantity |] [| "apple", date(2020, 4, 15), 3 |] [| "apple", date(2020, 4, 16), 7 |] [| "orange", date(2020, 4, 16), 2 |] table Products[pid] = by Orders.Pid Products.Producer = "Contoso" // broadcast Scalar -> Products Orders.Producer = Products.Producer // broadcast Products -> Orders show table "Orders" with Orders.Producer
First, the scalar literal
"Contoso" is broadcast to the
Products table. This broadcast is legal because the scalar table can - by construction - be broadcast to any other table with any ambiguity. Indeed, the scalar table has a single line, thus no matter how many lines are found in the receiving table, it’s the same value that gets copied.
Products.Producer variable is broadcast to the
Orders table. This broadcast is legal because
Products is a grouping table that has been produced from
Orders in the first place. Thus, every line in
Orders is associated to one and exactly one line in
Products without any ambiguity.
The most common type of broadcasts in Envision are the implicit broadcasts where the expression found on the right side of the assignment operator is mapped to the table specified on the table specified on the left side.
However, not all broadcasts are legal. For example, the following script does not compile :
table Orders = with [| as Pid, as Date, as Quantity |] [| "apple", date(2020, 4, 15), 3 |] [| "apple", date(2020, 4, 16), 7 |] [| "orange", date(2020, 4, 16), 2 |] table Products[pid] = by Orders.Pid Orders.Producer = "Contoso" Products.Producer = Orders.Producer // WRONG: illegal broadcast show table "Products" with Products.Producer
"Contoso" can be broadcast to
Orders (or any table actually), the variable
Orders.Producer cannot be broadcast to
Orders line is mapped to exactly one
Products line, but the inverse is not true.
Envision offers many ways to create or read tables while specifying the dimensions. The dimensions are particularly interesting because they allow Envision to establish whether a table can broadcast to another table.
Advanced remark: While there is nothing fundamentally novel about numeric broadcasts - Fortran had been supporting it for decades - with the advent of deep learning, there has been a renewed interest in this style of numeric operations, not only because it’s concise and elegant but also because it lends itself to a high degree of data parallelism, which nicely fits the capabilities of modern computing hardware, at least under favorable conditions. Indeed, arbitrary broadcasts - implemented as
Gather - have a notoriously low performance on GPUs. Most deep learning toolkits feature broadcast behaviors geared around tensors, which are more suitable for high-performance computing. Envision makes these behaviors first class citizens of the language instead of delegating them to a specialized library.
Roadmap : The table comprehension syntax is planned to be extended to support specifying the primary dimension within the comprehension itself, instead of having it segregated into a second grouping statement as done above.
While implicit broadcasts are more concise, and usually more readable as well, Envision also supports a mechanism for performing explicit broadcasts at the expression level with the keyword
into. This capability usually helps to reduce the verbosity of the Envision code, allowing to express in a single assignment what would have otherwise taken multiple lines of script.
The usage of the keyword
into is illustrated in the following script:
table Orders = with [| as Pid, as OrderDate, as Quantity |] [| "apple", date(2020, 4, 15), 3 |] [| "apple", date(2020, 4, 16), 7 |] [| "orange", date(2020, 4, 16), 2 |] where true into Orders // explicit broadcast show table "Nothing filtered" with Orders.Pid Orders.OrderDate Orders.Quantity
where true is not legal (the
Scalar table cannot be filtered here), and furthermore, it does not point out the intended table
into keyword can be suffixed after an expression to indicate the intended table.
The script above can be rewritten in a more verbose form without using the
into keyword with:
table Orders = with [| as Pid, as OrderDate, as Quantity |] [| "apple", date(2020, 4, 15), 3 |] [| "apple", date(2020, 4, 16), 7 |] [| "orange", date(2020, 4, 16), 2 |] Orders.IsTrue = true // implicit broadcast where Orders.IsTrue show table "Nothing filtered" with Orders.Pid Orders.OrderDate Orders.Quantity
Also, it would have been possible to rewrite the script further by leveraging the table prefix syntax on literals, as illustrated by:
table Orders = with [| as Pid, as OrderDate, as Quantity |] [| "apple", date(2020, 4, 15), 3 |] [| "apple", date(2020, 4, 16), 7 |] [| "orange", date(2020, 4, 16), 2 |] where Orders.true // explicit broadcast show table "Nothing filtered" with Orders.Pid Orders.OrderDate Orders.Quantity
Then, a slightly more realistic example could be considered with the following script:
table Orders = with [| as Pid, as OrderDate, as Quantity |] [| "apple", date(2020, 4, 15), 3 |] [| "apple", date(2020, 4, 16), 7 |] [| "orange", date(2020, 4, 16), 2 |] table Products[pid] = by Orders.Pid Products.Sold = sum(Orders.Quantity) where (Products.Sold > 4) into Orders show table "Only 1 order filtered" with Orders.Pid Orders.OrderDate Orders.Quantity show table "Nothing filtered" with Products.Pid Products.Sold
Which displays the following tables:
|apple||Apr 15, 2020||7|
|apple||Apr 16, 2020||3|
While it’s the
Products table that appears in the Boolean expression of the
where block, this expression is explicitly broadcast to the
Orders table. As a result, the
Orders table gets filtered, but the
Products table does not.
Most operations in Envision are vectorized, applied element-wise over the single line of the vector(s) involved. The process is relatively simple when all the vectors belong to the same table, however, Envision also supports many situations where vectors belong to distinct tables by automatically leveraging the broadcast behaviors detailed previously. In particular, the process usually boils down to identifying a common table. The script below illustrates this mechanism:
table Variants = with [| as Product, as Size |] [| "shirt", "small" |] [| "shirt", "medium" |] [| "pants", "small" |] [| "socks", "medium" |] table Products[Product] = by Variants.Product show table "Variants" with Products.Product Variants.Size
Which displays the following table:
show table mixes the two tables
Variants, and yet Envision identifies
Variants as the proper common table, and automatically performs a broadcast of the
Products vector toward the
More generally, a common table is acceptable if any table involved in the expression or block can be broadcast to this table. This condition must be fulfilled by the common table, but this condition alone is not sufficient to identify the common table, as multiple tables may be eligible. Thus, Envision probes candidate tables in the following order:
- Each table of the input vectors (or of the expression), starting with the first, and processing them in the script order.
- For each vector originating from a group table, the table from which the group table was created.
- The cross table constructed from all the dimensions found among the tables of the input vectors.
While it rarely makes sense from a supply chain perspective, it is possible to setup a situation where the common table is ambiguous, as illustrated by the following script:
table Variants[Vid] = with [| as Product, as Size |] [| "shirt", "small" |] [| "shirt", "medium" |] [| "pants", "small" |] [| "socks", "medium" |] table Alts[Alt] = by [Vid, Variants.Size] Alts.Foo = Variants.1 // broadcast 'Variants -> Alts' Variants.Foo = Alts.1 // broadcast 'Alts -> Variants' x = sum(Alts.Foo + Variants.Foo) // 'Alts' is the common table y = sum(Variants.Foo + Alts.Foo) // 'Variants' is the common table show summary "Resolved ambiguity" with x, y
The script above creates two tables named
Alts. The table
Alts is a group table, which will be covered in greater detail in the section Non scalar aggregation. The design allows the table
Alts to be broadcast to
Alts is a group table originating from
Variants. The design also allows the table
Variants to be broadcast into the table
Vid among its dimensions, which is the primary dimension of
Then, in the script above the operation
Alts.Foo + Variants.Foo is ambiguous because both
Variants could serve a common table. Yet, due to the rule listed above,
Alts is probed first, hence, it is selected as the common table.