List of aggregators
The aggregators are similar to functions, but they benefit from a richer syntax with options. Below, we list the aggregators supported by Envision.
There are two main usages for aggregators, whether a join between tables is performed or not.
When there is no join, the general usage for an aggregator
u = agg(v) by [w1, w2, w3] sort [w4, w5] if c or x over [a .. b]
aggis the name of the aggregator.
byis an option indicating the groups to be used when aggregating. It is possible to aggregate against one or more values.
sortis an option indicating the sorting order to be used when aggregating. It is possible to sort against one or more values.
ifis an aggregator filter. When present, all lines that are
falseare omitted from the aggregation.
oris an option indicating the value to return if the group is empty.
overis an option used to define an aggregation time-span, with boundaries relative to the date of the target line.
by is not specified, the grouping happens according to the context, typically using the left-side of the statement in the case of an assignment. It is possible to list multiple variables after
by; if only one variable is used, then the parentheses are not required.
sort option has typically no impact on most aggregators, e.g.
max, because those operators do not depend on any sort order in the first place. However, we suggest to leverage this option with
if is not specified, all lines are included that are equivalent to
or is not specified, default values are returned when the grouping is empty:
0 (zero) for number,
false for Booleans, the empty string for text and the zero distribution for distributions.
over option is available when aggregating data into a calendar table (see also Calendar tables). The boundaries of the time interval are expressed in days, weeks or months and they are inclusive. For example, considering a table Day, the option
over [-5 .. 0] will perform an aggregation over 6 days, namely the day of the target line and the 5 previous days.
Aggregation with table joins
The general syntax for this second case is similar but different to the one described above:
u = agg(v) by [w1, w2, w3] sort [w4, w5] if c at [z1, z2, z3] or x
at option indicates the groups used when joining the tables. Note that there must be a one-to-one correspondence between these groups and the groups used in the
by option. As a consequence, the
by option is mandatory and the number of entries specified in the
by and in the
at options must be identical.
The most commonly used aggregators are:
avg: numeric argument only, returns the average line value.
count: counts the number of lines that are either
true(Booleans) or non-zero (numbers) or non-empty (text).
distinct: counts the number of distinct values.
first: returns the value associated to the earliest line, as defined per the
last: returns the value associated to the latest line, as defined per the
max: returns the highest value.
median: behaves like the MEDIAN function in Excel.
min: returns the lowest value.
product: numeric argument only, returns the product of line values.
ratio: returns the percentage (between
1) of Booleans that are true, or of numbers that are non-zero. The ratio of an empty group is
same: expects all the grouped values to be identical (fails otherwise) and returns this value.
sum: numeric argument only, returns the sum of line values.
The aggregators below returns a Boolean value:
trueif all the input lines are
falseif all the input lines are
trueif there is at least one line passed to the aggregator.
These aggregators apply only to tables that benefit from a
latest(cross-table, special case): returns the most recent value observed.
latestis not compatible with any of the available options.
The aggregators below provide statistical calculations:
entropy: returns the Shannon Entropy of the group. The value is returned expressed in shannons.
mode: similar to the MODE function in Excel. In the case where more than one value have the highest number of occurrences, the aggregator returns the smallest of these values (different from Excel). If the input values are Booleans, “false” is taken as the smallest value.
norm: called a unity-based normalization, feature scaling, or rescaling (min-max normalization), computes the min and max values of each group, then rescales the values into a range of
[0 .. 1](where min becomes
0and max becomes
1). If min = max, all values become
percentile: similar to the PERCENTILE.INC function in Excel. It requires two arguments: a vector of data and a number representing the percentile target (included between
stdev: similar to the STDEV function in Excel.
stdevp: similar to the STDEV.P function in Excel.
zscore: similar to the STANDARDIZE function in Excel. The z-score of a value indicates how many standard deviations from the mean your score is. If the standard deviation is equal to zero, so is the z-score.
Distributions can be generated through aggregation (see also algebra of distributions):
ranvar(v): returns the empirical distribution of the associated random variable, using the numeric observations provided in the input vector.
ranvar(v1, v2): two-argument aggregator that returns the empirical distribution of the associated random variable, using the numeric observations provided in the input vector
v1along with the weights contained in vector
sumr: returns the sum of additive convolutions of random variables. Just like
sum()is the aggregator associated to the
+operator, the aggregator
sumr()is associated to the