# Aggregating data

Data aggregation is the process of combining multiple table lines through special functions called aggregators. Whenever sums, averages, counts or medians are involved, the calculations are carried out through aggregators. Aggregation also provides a way of combining data originating from different tables. Envision provides a rich syntax to support all such scenarios. In this section, we detail and illustrate how aggregators work with Envision.

## An illustrating script

In this section we use, once again, the sample dataset which should be accessible from the /sample path in your Lokad account. The script below is moderately complex and illustrates some of the data aggregation patterns available in Envision. Data aggregation can take place both outside the tiles and inside the tiles. We advise you to read Making calculations with Envision first, as it should greatly help you with understanding the contents of this section.

read "/sample/Lokad_Items.tsv" as Items with
Id : text
Category : text
Brand : text
Supplier : text

Id : text
Date : date
Quantity : number
NetAmount : number

Id : text
Date : date
NetAmount : number

show label "Aggregating data" a1f1 tomato

oend = max(Orders.Date)
obegin = oend - 365

totalPurchased = sum(PO.NetAmount)
totalSold = sum(Orders.NetAmount) when date >= obegin
show scalar "Total purchased" a2c2 with
totalPurchased unit:"$" show scalar "Sold over 1 year" d2f2 with totalSold unit:"$"

Items.VolumeSold = sum(Orders.NetAmount)
Items.UnitSold = median(Orders.Quantity)
show table "Top sellers" a3f4 tomato with
Items.Name
Items.VolumeSold as "Sold" unit:"$" Items.UnitSold as "Median" order by Items.VolumeSold desc show table "Top suppliers" a5f6 tomato with Items.Supplier distinct(Items.Category) as "Categories" sum(Orders.NetAmount) as "Sold" unit:"$"
mode(Items.Brand into Orders) if (Items.Brand != "Flash") as "Most frequent brand sold"
group by Items.Supplier
order by sum(Orders.NetAmount) desc


We suggest to begin by copy-pasting the above script into your Lokad account, and running it once in order to observe the resulting dashboard. If everything works, you should see the dashboard illustrated below.

## Scalars for single-value aggregations

A variable that has no “item” affinity in Envision is referred to as a scalar variable (the variable is not attached to any particular line of the Items table). With the Excel analogy in mind, Envision variables are akin to Excel columns by default: these variables are vectors and they hold many values at once (one value per item actually). However, it is also possible to have variables that behave like a single Excel cell, and that only hold a single value: these are the scalar variables. The script above illustrates how two scalar variables can be computed; the relevant lines are copied below for the sake of clarity.

totalPurchased = sum(PO.NetAmount)
totalSold = sum(Orders.NetAmount) when date >= obegin
show scalar "Total purchased" a2c2 with
totalPurchased unit:"$" show scalar "Sold over 1 year" d2f2 with totalSold unit:"$"


Several aspects can be highlighted in these few lines of code. Firstly, the two assignments on lines 1 and 2 are carried out using = toward the implicit Scalar table. Secondly, the variables totalPurchased and tableSold begin with lowercased names. While variable names are not case-sensitive in Envision, the spelling of these two variables is not an accident in this case. As a coding practice, we recommend to lowercase the first letter of the variables that contain scalar values. In contrast, the names of the tables and the names of the columns should start with an uppercase letter. Following these guidelines will make your script more readable and easier to debug. Thirdly, the scalar aggregation takes place by default within the show table statement if the table has only a single column. For example, lines 1 and 3-4 could be combined into a single show table statement illustrated just below.

show scalar "Total purchased" a2c2 with
sum(PO.NetAmount) unit:""  Numeric scalar variables can also be used to do arbitrary calculations just like any numeric variable in Envision. For example - while it may not make much sense from a business perspective - the following calculation could be added at the end of the Envision script, once the variables totalPurchased and totalSold have been defined: mySum = (totalSold - totalPurchased) / 2 show scalar "My Sum" with mySum  ## Multi-value aggregations Envision also supports the possibility to perform multi-value aggregations, typically aggregating the data of a given table and projecting the result on the lines of another table. Thus, aggregation is the most frequent approach to combine data coming from different tables. In particular, one of the most frequent use cases consists of taking an historical table, that is, a table also indexed with a Date column, and aggregating this table into a vector aligned with the Items table. The script shown previously illustrates this pattern with a table that displays the top sellers, that is, the items ordered according to their respective sales volume, putting the highest values at the top. The relevant lines of script are copied below for the sake of clarity. Items.VolumeSold = sum(Orders.NetAmount) Items.UnitSold = median(Orders.Quantity) show table "Top sellers" a3f4 tomato with Name Items.VolumeSold as "Sold" unit:""
Items.UnitSold as "Median"
order by Items.VolumeSold desc


Lines 1 and 2 are performing aggregations in a way that is very similar to the scalar aggregations that we have reviewed in the previous section. However both VolumeSold and UnitSold are vectors which are part of the Items table. Thus, these variables are similar to columns in Excel, and they contain one value per item.

Both sum and median are special functions named aggregators in Envision. There are many other aggregators in Envision, such as avg, min, max, first, last. For the sake of concision, we are not reviewing all the aggregators available in Envision in this section, however, for more details, you can check out the full list of aggregators.

The result of these two aggregations is displayed in the table defined with the lines 3 to 7. The two vectors VolumeSold and UnitSold are listed as arguments after the keyword with, and their values are displayed in the table. Finally on line 7, the order by statement indicates that the table should be sorted starting with the items that have the highest VolumeSold values.

Readers who may be familiar with the GROUP BY syntax in SQL might be wondering how Envision knows which grouping should be used when computing the sum at line 1. By default, Envision performs a grouping using the columns which act as “keys” on the right side of the assignment. In the case of a variable that belongs to the Items table – the table whose name is implicit – the column acting as the (primary) key is the Id column. This explains why a per-item aggregation results from using the = sign.

## Explicit aggregation groups with by

So far, the aggregations we have performed relied on the implicit aggregation patterns of Envision. However, the behavior of all the aggregators can be modified with an optional keyword by that is used to explicitly specify the applicable grouping. Let’s illustrate how this keyword by is used:

Items.VolumeSold = sum(Orders.NetAmount)
//equivalent to
Items.VolumeSold = sum(Orders.NetAmount) by Id


Line 2 is assigned to a second vector named VolumeSold but the values of this vector are strictly identical to the one of the vector VolumeSold at line 1. In fact, the option by Id is also used, albeit implicitly, at line 1. Intuitively, when the option by is used, it is as if groups were created first according to the grouping target, and second, as if the aggregator were separately computed for every group. The by option offers the possibility to compose fairly complex aggregations, as illustrated by the script at the beginning of this section. Let’s review the two lines of script where the aggregations are performed through by options.

avgRet = avg(distinct(Orders.Date) by Orders.Client)
avgQty = avg(sum(Orders.Quantity) by [Orders.Client, Orders.Date])


The aggregator distinct counts the number of distinct values observed within each group. At line 1, the lines of the Orders table are first grouped according to their respective Client values; and then for every client, the number of distinct ordering dates is counted. Intuitively, this aggregation can be interpreted as counting the number of times each client has returned. Then, this result is re-aggregated into a single scalar value with the avg aggregator that contains the inner distinct aggregation.

The scalar avgQty can be interpreted as the number of units purchased per basket. The calculation begins with a sum() by statement, however after the by option, we have not one but two variables separated by commas and listed within brackets: [Orders.Client, Orders.Date]. This syntax should be understood as follows: create a group for every pair of Client and Date. From a business perspective, we are treating all units purchased the same day as the same basket, which is a reasonable approximation for most situations. Finally, the external call to avg produces the final average over all the sum aggregates computed for all the pairs.

More generally, the by option supports an arbitrary number of variables following the syntax sum() by [arg1, arg2, …, argN]. In practice, however, it is rather rare to encounter situations where it makes sense to group by more than 4 variables at once. Also, the order of the arguments has no impact on the resulting groups used to compute the aggregates.

### Explicit aggregation groups with into

The into keyword followed by a table name allows to perform an aggregation analogous to by followed by a column name. The grouping is then done according to the table’s primary key. Let’s illustrate this with the previous example:

Items.SameVolumeSold = sum(Orders.NetAmount) into Items
//equivalent to
Items.SameVolumeSold = sum(Orders.NetAmount) by Orders.Id at Items.Id
//equivalent to
Items.SameVolumeSold = sum(Orders.NetAmount)


Here we have replaced by Id (rather by Orders.Id at Items.Id) with into Items: they are equivalent since the primary key of Items is Id. Similarly, into Day is equivalent to by Date and into Scalar to by 1.

As a rule of thumb, if the target table of the aggregation is known, we recommend to use the into TableName syntax, the advantage being that no new indices are created for the conversion from the argument table to the target one.

The into keyword offers also a very compact syntax to import scalars (numbers, Booleans…) into a table, without explicitly creating a column. For example:

nb = distinct(Seasons.Season)
table NewTable = extend.range(nb into Items)
//equivalent to
Number = distinct(Seasons.Season) by 1
table NewTable = extend.range(Number)


## Explicit aggregation within a table with group by

Sometimes, through aggregation, we can produce a new table which is more relevant to be used within a dashboard than the original non-aggregated table. Thus, Envision also supports the possibility of aggregating data directly from the tile declaration statement. The most direct way of visualizing this Envision capability is to aggregate the data which is to be displayed in a table. This is precisely what is being done in the script at the top of this page. Let’s have a second look below at the relevant lines copied below.

show table "Top Suppliers" with
Items.Supplier
distinct(Items.Category) as "Categories"
sum(Orders.NetAmount) as "Sold" unit:"$" mode(Items.Brand into Orders) if (Items.Brand != "Fellowes") as "Most frequent brand sold" group by Items.Supplier order by sum(Orders.NetAmount) desc  The show table statement goes from line 2 to line 8, and more specifically, the aggregation is specified on line 7 with the statement group by, which is exactly the same semantic as the by option that we have reviewed in the previous section. At this point, you might be wondering why Envision is not using the keyword by instead of group by if the semantic is the same. The if at line 6 is an aggregator filter, discussed in the next section. The answer is simple: it is possible to use the by option within the list of expressions used after the with for a given tile (such a situation is not illustrated here however). Thus group by allows to differentiate between a by statement that is part of an expression passed to the tile, and a group by statement that applies to the tile as a whole. In other words, the group by applies to all the expressions listed after the with keyword, in contrast to the by statement that only has a local impact. When group by is used, all expressions passed to the tile after the with keyword should offer the possibility of being aggregated. For example, the 1-line script below is incorrect because Name does not offer an aggregation pattern while group by Supplier is specified. show table "WRONG!" with Name group by Items.Supplier  However, if we modify this script by introducing an aggregator, distinct for example, then our Envision script becomes valid. show scalar "CORRECT!" with distinct(Items.Name) group by Items.Supplier  The only exception to this rule is the aggregation of the target itself. In the script at the beginning of this section, we have a group by Supplier statement on line 7. On line 3, the variable Supplier is listed without any aggregator, yet the script remains valid, precisely because the grouping takes place according to the Supplier variable. The need to expose an aggregator also applies to the order by statement on line 8. Indeed, the table is first aggregated by supplier, and only then is it sorted according to the Id column – unless another sort order is specified. As a result, Envision needs to compute one value per group in order to sort all these groups. This is exactly what we see happen with the statement group by sum(Orders.NetAmount). While we illustrated group by with the table tile, this syntax is not specific to the table tile, and the same pattern can be used with most other tiles. For example, it would be possible to extend the script listed at the very top of this page with a barchart tile aggregated by Brand: show barchart "Sales by brand" with sum(Orders.NetAmount) as "Net Amount" group by Items.Brand  The group by also supports multi-grouping, that is grouping by multiple vectors or multiple expressions, with a syntax that is identical to the one of the by option that we have covered previously. ### Explicit aggregation within a table with group into As into is analogous to by, a group into statement is supported as well, in analogy with group by. ## Explicit aggregation between two tables with cross Data aggregation can be performed across two tables thanks to the cross keyword. Let us consider the following example: Items.TotalNetAmountLastMonth = sum(SellPrice * Orders.Quantity) cross (Items, O) if (Orders.Date >= max - 30)  The cross join establishes a connection between each line of the first table and all the lines of the second table. An anonymous temporary table is thus created, where the aggregation is computed, taking the specified filters into account. However, the argument of the aggregation might be evaluated before the cross join is applied, if the two tables are already related (ex. one table with itself, or Items and Orders tables through their primary key Id). In such case, the two tables can be made independent of each other by using an alias: Items.TotalNetAmountLastMonth = sum(Items.SellPrice * T.Quantity) cross (Items, O as T) if (T.Date >= max - 30)  The size of the cross-join table is currently limited to 4 billions lines. Thus, if crossing a table with itself, the table must be no larger than about 60k lines. ## Aggregation filters The Envision aggregators also support filters through the if keyword. The script below illustrates the use of such a filter at line 6: show table "Top Suppliers" with Items.Supplier distinct(Items.Category) as "Categories" sum(Orders.NetAmount) as "Sold" unit:"$"
mode(Items.Brand into Ordrs) if (Items.Brand != "Fellowes") as "Most frequent brand sold"
group by Items.Supplier
order by sum(Orders.NetAmount) desc


Here the brand Fellowes is explicitly excluded from the report through the if filter. Note that Brand is explicitly projected from the Items table into Orders table in the mode aggregator and implicitly in the if filter. Filters are useful within tile statement, because they offer the possibility to filter each column separately. In contrast, a where filter positioned outside the tile statement would have filtered all the selected lines of the Orders table indifferently for all columns.

Filters can also be used inline:

Items.TwoAndMore = sum(Orders.1) if (Orders.Quantity >= 2)
//equivalent to
Items.TwoAndMore = count(Orders.Quantity >= 2)


### Using indexers to access chosen lines of a table

In the case of tables with explicitly defined keys (i.e. not including a *, as in T[Id]), Envision supports the seek functionality: specific lines of the table can be directly accessed by selecting the wished value of the indexers. The syntax is the following:

ref = "16395266"
show scalar "SellPrice of \{ref}" with SellPrice[ref]
//equivalent to
where Id == ref
show scalar "SellPrice of \{ref}" with same(Items.SellPrice)
//equivalent to
show scalar "SellPrice of \{ref}" with same(Items.SellPrice) if (Id == ref)


As an additional example, the seek functionality applies to calendar tables:

Day.Items.WeeklyMovingAverage = sum(Orders.Qty) over [-6 .. 0] // sum of the sales by item and day, over the last 7 days
todayDate = today(1)
show table "Sales Comparison" with
Id
Day.Items.WeeklyMovingAverage[Id, todayDate] as "Last week sales"
Day.Items.WeeklyMovingAverage[Id, todayDate - 30] as "One-month-ago week sales"