Joining tables

Envision provides an implicit mechanism to join tables whenever those tables contain an Id (item index) field, as it is frequently the case to represent sales history or purchase history. However, this implicit mechanism - called natural joins - is not suitable for more complex situations. Thus, Envision offers a more general join mechanism referred to as the by-at. The by-at is a special kind of aggregation that performs specific joins over tables.

An illustrating script

In this section, we use the sample data set which should be available from the /sample path in your Lokad account. If it’s not already done, we suggest first to read the entry reading files with Envision, with a closer look at the file type expectations, as this section is highly relevant to better understand the following script.

read "/sample/Lokad_Items.tsv" as Items[Id] with
  Id : text
  Name : text
  Supplier : text

read "/sample/Lokad_Suppliers.tsv" as Suppliers with
  Supplier : text
  Moq : number

Items.Moq = same(Suppliers.Moq) by Suppliers.Supplier at Items.Supplier
show table "Item list" a1d4 tomato with

The first three lines are our usual statements for reading files. On line 6, the read statement indicates that the suppliers table isn’t getting any type expectations - it’s an independent table with no primary key and no foreign key. Indeed the file Lokad_Suppliers.tsv - unlike all the other .tsv files of the folder - does not contain an Id column. As this column is missing, Envision cannot implicitly join this table with the Items table. As a matter of fact, the supplier’s table is aggregated per supplier with one supplier per line, while all the other files are detailed per item.

On line 10, we leverage the by-at aggregation, called this way because it uses the by and at keywords, to perform a join between the Items table and the Suppliers table. We could have similarly written:

Items.Moq = same(Suppliers.Moq) by Suppliers.Supplier at Items.Supplier

However, per Envision’s usual conventions, the Items table name is omitted.

In this specific example, we have populated a vector Items.Moq with the data obtained from the table Suppliers. A join has been made between the two fields Items.Supplier and Suppliers.Supplier.

In the case that a supplier entry found in the Items table does not have a counterpart in the Suppliers table, a zero value will be used. However, through the optional default keyword, following the usual behavior of the Envision aggregators, other default values could be used. We could have written:

Items.Moq = same(Suppliers.Moq) by Suppliers.Supplier at Supplier default 1

On lines 11 to 16, the resulting Items table is displayed in order to illustrate the calculation performed by the by-at the line above. As expected, each item is associated with the MOQ value of its respective supplier.

General syntax for the by-at

The general syntax for the by-at is:

Items.Moq = same(Suppliers.Moq) by Suppliers.Supplier at Supplier default 1

Much like the usual aggregators in Envision, the by-at offers the possibility to perform a tuple match, that is, matching n fields at a time. Then, all the usual Envision aggregators can be used: sum, but also min, median, same, etc. The default block can be omitted just like it is done with a regular aggregation.

The semantic of the by-at is as follows:

Just like regular aggregators, the default statement is optional, and provides the value to be used as a result of the aggregation if the group is empty.

Tips for the by-at

The by-at is a powerful construct that can be leveraged in many situations, not just to join tables from the usual SQL perspective.

A table can be joined to itself: for example,

Orders.DaySum = sum(Orders.Quantity) by Orders.Date at Orders.Date

illustrates how to compute daily totals over the Orders table without resorting to the Day table.

Calendar tables Day and Week can be joined to: for example,

Day.Shift = sum(Orders.Quantity) by [Orders.Id, Orders.Date - 1] at [Day.Id, Day.Date]

illustrates how to shift the quantities from one day through a by-at.

Translating the by-at into SQL

For the readers who are already familiar with SQL, then it can be noted that the Envision expression:

Moq = same(Suppliers.Moq) by Suppliers.Supplier at Supplier or defaultMoq

has the following equivalent in SQL, which uses an outer left join: UPDATE Items LEFT OUTER JOIN Suppliers ON Items.Supplier = Suppliers.Supplier SET Moq = COALESCE(Suppliers.Moq, defaultMoq) The Envision syntax emphasizes Excel-like calculations more than the relational algebra itself.

Cross Joins

The cross join is another classic relational operation supported by Envision. A cross join operates over a Cartesian product of the lines involved in the join. As a cross join can easily generate a large number of lines to be processed, we suggest to pay close attention to the size of the joined tables in order to keep the processing time under control. The syntax for cross join is the following:

T1.R = sum(T1.A * T2.B) cross (T1, T2) if (T1.C and T2.D) or T1.Default

This statement creates a temporary anonymous table that is the cross join of tables T1 and T2. The aggregator argument, as well as the if filter, are evaluated in this table (and therefore, forcibly inflated from tables T1 and T2). The result of the aggregation is computed into table T1 by aggregating together the filtered lines of the cross join that correspond to each line of T1.

If there is already a relationship between T1 and T2 - such as Items and Orders where the latter can be inflated into the former - or if both tables are the same, then the alias-based variant can be used:

T1.R = sum(T1.A * TAlias.B) cross (T1, T2 as TAlias) if (T1.C and TAlias.B) or T1.Default

The alias table TAlias contains the same variables as T2, but has no relationship with any table (except the cross table itself). The alias table is not available outside the aggregator argument and if filter. An alias name may be used more than once in a given script (in separate aggregators), but must not conflict with input tables, built-in tables, or tables created with a table T = ... statement (as illustrated above).

The size of the cross-join is currently limited to 4e9 (four billion) lines. If crossing a table with itself, this means the table may be no larger than about 60k lines. In practice, Envision will feel noticeably slow much before actual limit at 4 billion lines is reached. We suggest to make sure that joined table are properly filtered before the cross join in order to keep the complexity in control.