User defined functions

User defined functions are, as the name suggests, functions that are defined within an Envision script rather than being part of the standard library of functions. In this section, for the sake of brevity, functions refer to “user defined functions”.

Table of contents

Purpose

Functions allow you to factorize logic, that is, to avoid writing the same logic in many different places, and replacing those occurrences by calls to the same functions.

Functions offer loop-like behaviors and represent a class of alternatives to, well, actual loops. Unlike loops, functions - as introduced by Envision - offer better guarantees for both correctness and performance.

There are two flavors of functions respectively identified by the keywords pure and process. Processes (i.e process functions) are more complex than pure functions. Also, processes are intended to be instrumented through the by, at, sort, scan options that we have previously covered while introducing the aggregators. It turns out that aggregators are a specialized class of processes.

Neither pure functions nor processes have any side effects on their arguments or options. In this regard, Envision is a strictly functional language (as in mathematical functions), and the same goes for functions, even those that are defined by users.

Processes, however, do maintain an internal state (or simply state) that changes through their lifecycle. Intuitively, the state represents the information that is maintained while the computation is in progress. For example, for a sum, the partial sum associated with the numbers added so far is the state.

Advanced remarks: The way Envision approaches functions is probably somewhat surprising to most experienced programmers. In mainstream programming languages, compute and memory usage are (largely) delegated to the programmer. Depending on the language, more or less, tooling is baked into the language to ease this management (e.g. a garbage collector) but nonetheless the programmer is expected to be the one in charge of the computational resources. Envision approaches the problem from a different perspective where the functions that compile automatically benefit from desirable properties in regards to the computational resources. Supply chain scientists are not expected to be intimately familiar with those. The underlying platform ensures that the amount of resources consumed remain roughly proportional to the amount of data while leveraging data parallelism to the greatest extent, to keep wall-clock time under control even when processing terabytes of data.

Pure functions

Functions take a series of arguments, do some processing and return a series of values. All functions are declared with the keyword def. Pure functions, specified with the keyword pure, are the simplest kind of function. Let’s illustrate how a simple hello() function that prefixes Hello to a text value passed as an argument can be defined:

def pure hello(a : text) with
  return "Hello \{a}"

greeting = hello("World!")

show label greeting

Which, unsurprisingly, displays Hello World!.

The first line of the above script contains the declaration of the hello() function. This declaration starts with the keyword def and ends with the keyword with. The keyword pure indicates the type of the function. The argument is named a, and its data type is specified through the colon (:) symbol. As usual, with opens a new block, and thus the next line comes with an extra level of indentation. Then, the second and last line of the function definition uses the keyword return to specify the value returned by the function. The function hello() is then called to define the scalar value greeting, and this value is finally displayed through a label tile at the last line of the script.

All the function’s declarations require the sequence of keywords def, with and return. Also, the last statement of the function block must be a return statement. The type of the function can be either pure or process - we will be getting back to the latter in the following. Finally, a function must be declared before being called. The call syntax is identical to the one used to call functions from the standard library.

Functions (processes included) come with table-free (i.e. scalar) arguments and return a tuple of table-free values as well. The compiler does not allow the use of a table prefix within the function declaration line or within the function body. This design differs from regular Envision expressions that always have an affinity to a specific table, even if it’s only the Scalar table. While the terminology can be a bit confusing, when we say that functions are scalar, we indicate they operate over values instead of vectors.

The scalar nature of functions offers the possibility to automatically vectorize them, i.e. let them take a vector as input and return a vector as output as illustrated by:

table Audience = with
  [|as Folks |]
  [| "Ladies" |]
  [| "Gentlemen" |]
  [| "Mr. President" |]

def pure hello(a : text) with
  return "Hello \{a}"

show table "Greetings" with
  hello(Audience.Folks) as "Hello!"

Which displays the following table:

Hello!
Hello Ladies
Hello Gentlemen
Hello Mr President

In the above script, the function hello() is being implicitly called three times, once for every line of the Audience table.

Under the hood, the Lokad platform not only vectorizes the hello() function, the platform can also distribute the computation over multiple CPUs and even multiple machines if the workload justifies such a high degree of parallelism (i.e. large vectors passed as argument).

Arguments and return values

Functions can have multiple comma-separated arguments as illustrated by:

def pure myProduct(a : number, b : number) with
  p = a * b
  return p

r = myProduct(6, 7)

show label "\{r}"

In the above script, the variable p is introduced as a local variable.

Functions can also return multiple values through tuples as illustrated by:

def pure euclidianDiv(a : number, b : number) with
  quotient = floor(a / b)
  remainder = a - quotient * b
  return (quotient, remainder)

q, r = euclidianDiv(43, 6)

show label "\{q}, \{r}"

The syntax to deconstruct the returned tuples is identical to the one introduced previously for with blocks (cf. the previous section “Scoping and tuples”).

While defining a function, it is possible to call other functions as long as they are either part of the standard library (as illustrated by calling the function floor() above) or that they have been defined before. However, while processes can call both pure functions and processes, pure functions can only call pure functions. We will revisit this angle in greater detail in the following.

Branching

A branch refers to an instruction that tells the computer to execute a different part of a program rather than executing statements one by one. Branches are supported by functions through the keywords if .. else if .. else ... Let’s consider a function that provides a visual indicator that a measurement is growing with:

def pure isGrowing(oldValue : number, newValue : number) with
  r = ""
  if oldValue < newValue
    r = "yes"
  else if abs(oldValue - newValue) < 0.01
    r = "maybe"
  else
    r = "no"
  return r

In the above script, the if keyword is used to branch the flow of execution within the function. The variable r is defined with the empty text value. If the expression oldValue < newValue is evaluated as true then r is overwritten with the text value yes. If not, then another test is made, if oldValue is close to newValue then, r is overwritten with the maybe text value. Finally, if both previous conditions were wrong, then r is overwritten with the value no.

The three statementsif, else if and else introduce blocks, hence, in the line that follows any of those statements, an extra level of indentation must be used. Also, else if and else statements are both optional. For example, the script above can be rewritten in a slightly more concise form by taking advantage of this:

def pure isGrowing(oldValue : number, newValue : number) with
  r = "no"
  if oldValue < newValue
    r = "yes"
  else if abs(oldValue - newValue) < 0.01
    r = "maybe"
  return r

The above script is logically strictly equivalent to the previous one. However, the else statement is omitted as the r variable is already initially defined with the text value no.

Along with branching, early termination is also supported. Functions can be declared with multiple return statements. The following script illustrates how the previous example can be further simplified:

def pure isGrowing(oldValue : number, newValue : number) with
  if oldValue < newValue
    return "yes"
  else if abs(oldValue - newValue) < 0.01
    return "maybe"
  return "no"

The above script is logically equivalent to the two previous ones. However, we omit entirely to declare a local variable r, and leverage three distinct return statements to achieve the same behavior.

It is also possible to omit the final return statement if all branches do return:

def pure isGrowing(oldValue : number, newValue : number) with
  if oldValue < newValue
    return "yes"
  else if abs(oldValue - newValue) < 0.01
    return "maybe"
  else
    return "no"

When multiple return statements are found in the declaration of a function, they must all return the same data type. For example, the following function attempts to return both a number and a text, which does not compile:

def pure myAbs(a : number) with
  if a > 0
    return a
  return "negative" // WRONG! Incompatible returned type Number against Text

Overloading

It is possible to declare multiple functions that have the same name, a mechanism known as overloading as long as they don’t have the same argument types. For example the following script declares to myAdd() functions:

def pure myAdd(a : number, b : number) with
  return a + b

def pure myAdd(a : text, b : text) with
  return "\{a}\{b}"

It is not allowed to overload the names of the standard library function.

Context capture

The code of a function cannot access any variables from the script that defines it, with the exception of constants defined above its definition. The following script illustrates this mechanism:

const a = 42

def pure myAdd(x : number) with
  return x + a

show scalar "" with myAdd(13) // 55

In the above script, the variable a is first flagged as const (marking it is a compile-time constant). It is then is then used in the declaration of the function myAdd. We say that the variable a is captured by the function myAdd.

A function can call any function defined above itself:

def pure myAdd(x : number, y : number) with 
  return x + y

def pure timesTwo(x: number) with 
  return myAdd(x, x)

Limitations

While it is authorized to have a function calling either a previously defined function or a function of the standard library, functions cannot call themselves recursively.

def pure factorial(a : number) with
  if a <= 1
    return 1
  return a * factorial(a - 1) // WRONG! recursive call not allowed

A pure function may not call a pseudo-random number generator (such as random.poisson) unless it is flagged as random.

Processes

While functions are scalar in design, they are intended to be ultimately used on vectors. This usage pattern implies that while a single function call is written (in the script), the underlying function actually gets called multiple times, once per table line. So far, with pure functions, those calls were kept strictly decoupled. The whole point of processes is to introduce coupling between the calls in order to perform calculations that would have otherwise been impossible under the strict decoupling rule. Let’s revisit our greeting examples with:

table Audience = with
  [|as Folks |]
  [| "Ladies" |]
  [| "Gentlemen" |]
  [| "Mr. President" |]

def process hello(a : text) with
  keep call = 0
  call = call + 1
  return ("Hello \{a}", call)

Audience.Hello, Audience.Call =
  hello(Audience.Folks) scan Audience.Folks

show table "Greetings" with
  Audience.Hello
  Audience.Call

Which displays the following table:

Name Call
Hello Ladies 2
Hello Gentlemen 1
Hello Mr President 3

In the above script, hello() is declared as a process function - or simply process - through the use of the process keyword (which replaces the pure keyword that we have used so far). Then, the variable call is introduced and initialized with the 0 value by the keyword keep, which indicates that this variable is part of the state of the process. This state is preserved from one execution of the process to the next. At each execution, the variable call is incremented through the line call = call + 1. Finally, at each execution, a tuple containing both the greeting and the call index is returned.

The scan option, as detailed earlier in the “Aggregating” section, entails a sorting behavior. Thus, the Audience table gets sorted against Audience.Folks and then, the process hello() is repeatedly executed for each line of the table.

A process maintains an internal scalar state that is preserved across the calls. A process is a function, but not a pure function, because a side effect is involved: the state of the process gets modified by each successive call. Nevertheless, there are no side effects as far as arguments or options are concerned.

A process’s calls involve (call) options such as by, at, sort, scan that we already introduced in the “Aggregating” section. These options specify the fine print of the sequence of calls (executions) to be used for the process. In fact, most of the aggregators can almost be re-implemented with processes as illustrated by mySum() and myAvg() in:

table Numbers = with
  [| as N |]
  [| 3 |]
  [| 4 |]
  [| 8 |]

def process mySum(a : number) with
  keep sum = 0
  sum = sum + a
  return sum

def process myAvg(a : number) with
  return mySum(a) / mySum(1)

sum = mySum(Numbers.N) sort Numbers.N
avg = myAvg(Numbers.N) sort Numbers.N

show summary "" with sum, avg

Unlike some aggregators, all processes need a sort order to be specified, either via the keyword sort or via the keyword scan. Indeed, the Envision compiler is not capable of automatically detecting when ordering happens to be inconsequential, and thus, ordering is enforced to eliminate potential ambiguities. However, the auto keyword can be used to specify the use of the canonical order of the table itself:

def process mySum(x : number) with
  keep s = 0
  s = s + x
  return x

table T = extend.range(10)
T.S = mySum(T.N) scan auto
show table "" a1b10 with T.N, T.S

Process lifecycle

The lifecycle of a process starts with its initialization, followed by a series of updates, each update includes the emission of a result, and ends with its reset. In order to walk through this lifecycle, let’s revisit a variant of the script introduced in the previous section with:

table Audience = with
  [|as Folks, as Kind |]
  [| "Ladies", "cohort" |]
  [| "Gentlemen", "cohort" |]
  [| "Mr. President", "person" |]
  [| "Ms. President", "person" |]

def process hello(a : text) with
  keep count = 0
  count = count + 1
  return ("Hello \{a}", count)

Audience.Hello, Audience.Count = hello(Audience.Folks)
                                 by Audience.Kind
                                 scan Audience.Folks

show table "Greetings" with
  Audience.Hello
  Audience.Count

Which displays the following table:

Name Call
Hello Ladies 2
Hello Gentlemen 1
Hello Mr President 1
Hello Ms President 2

In the above script, we add a second vector Audience.Kind to the Audience table. Then, when calling the hello() function we add the by Audience.Kind option, which wasn’t present in the previous execution of the script. As a result, we observe that in the displayed table calls are counted twice, first for the cohort group and second for the person group.

Under the hood, the lifecycle undergone by the hello() process is as follows:

The groups identified though the option by specify when the process must be initialized and reset. When scanning, a result is emitted for each line processed. However, when simply aggregating, the result is only emitted for the last line of the group.

As pure functions don’t have a group-level lifecycle they can’t internally call a process. However, the opposite works: a process can internally call pure functions, as well as processes.

Group arguments

Group arguments exist for processes. We have already encountered this kind of arguments in the section “Aggregating”, and we have seen they are introduced by the semicolon (;) delimiter. The group arguments are aligned (table-wise), as the name suggests, with the underlying group table. Let’s illustrate this mechanism with a variant of the script introduced in the subsection “Aggregating, Group arguments”:

table Variants = with
  [| as Product, as Color, as Limit  |]
  [| "pants", "blue", " " |]
  [| "shirt", "pink", ", " |]
  [| "shirt", "white", ", " |]
  [| "socks", "green", " - " |]
  [| "socks", "yellow", " - " |]

table Products[Product] = by Variants.Product
Products.Limit = same(Variants.Limit)

def process myJoin(a : text; limit : text) with
  keep c = ""
  if c == ""
    c = a
  else
    c = "\{c}\{limit}\{a}"
  return c

Products.Colors = myJoin(Variants.Color; Products.Limit)
                  sort Variants.Color // 'by Products' is implicit

show table "Colors" with
  Product
  Products.Colors

Which also displays the following table:

Product Colors
pants blue
shirt pink, white
socks green - yellow

The difference between the above script and its original version from the “Aggregating” section is the use of the myJoin() process, which re-implements the join() aggregator found in the standard library. While declaring the process, group arguments are introduced after the semicolon delimiter. The core logic of myJoin uses a single text variable as its state, which gets expanded through concatenation at every execution. The first execution has to be special-cased, hence the branch if c == "" as the delimiter is omitted when there is only a single value.

A process can have multiple arguments separated by commas (,) and multiple group arguments also separated by commas. The two kinds of arguments are delimited by a semicolon (;). The group arguments are available all the time, but their values remain unchanged for the duration of the cycle associated with the group.

Advanced remark: The native implementation of join() is more efficient than its naive implementation myJoin() as introduced above. In Envision however, text values are limited to 256 characters, hence, even the naive implementation would be limited, by design, in its capacity to deliver bad performance.

Defaults on empty groups

Processes do return values even on empty groups. Envision provides several mechanisms to control those values. The following script provides side-by-side alternatives, clarifying the behavior of the process over an empty group.

def process myProductNoDefault(x : number) with
  keep prod = 4 // not set on empty groups
  prod = prod * x
  return prod

def process myProduct(x : number) default 1 with
  keep prod = 4 // not set on empty groups
  prod = prod * x
  return prod

table T = extend.range(5)

keep where (T.N > 10)

x = myProductNoDefault(T.N) sort T.N   // 0
y = myProduct(T.N) sort T.N            // 1
z = myProduct(T.N) sort T.N default 42 // 42

show summary "" a1c1 with x, y, z

In the above script, the keep where statement ensures that the table T is empty until the end of the script. Thus, the process calls which define x, y and z are performed over an empty table:

State initialization

The state of a process includes all its state variables, i.e. variables that are declared with the keyword keep. Let’s illustrate the capabilities of the state initialization with:

table Numbers = with
  [| as N |]
  [| 3 |]
  [| 4 |]
  [| 8 |]

def process mySum(a : number; init : number) with
  keep cpy = init
  keep sum = init + abs(-1)
  sum = sum + a
  return sum

sum = mySum(Numbers.N; 1000) sort Numbers.N

show summary "" with sum

Which displays the result 1016.

The state variable cpy is initialized with the group argument init. Then, in turn, the state variable sum is initialized using the other state variable init along with an expression abs(-1) that evaluates to 1. The process is called with 1000 as the sole init value as there is only one group in this case and returns 1016, which gets displayed.

State initialization has to follow several rules. All the keep statements must be grouped at the top of the function declaration prior to any alternative statement. The state variable, i.e. defined with keep, can use any group argument in its defining expression, as well as any other state variable that has already been defined. The definition expression can use pure functions, either user defined or from the standard library.

Process instances

A process is the combination of a state and transitions (between states). The syntax introduced so far blends the state and its transition in a fairly concise manner, however there are situations where it is useful to de-entangle the two. A process instance is a construct offered by Envision to perform this disentanglement. Let’s illustrate how a process instance is introduced within a process with:

table Numbers = with
  [| as N |]
  [| 2 |]
  [| 4 |]
  [| 8 |]

def process mySum(a : number) with
  keep process myInstance = sum(number)
  statePlusOne = myInstance + 1
  updatedState = myInstance(a)
  return (statePlusOne, updatedState)

Numbers.StatePlusOne, Numbers.UpdatedState = mySum(Numbers.N)
                                             scan Numbers.N

show table "" with Numbers.StatePlusOne, Numbers.UpdatedState

Which displays the following table:

StatePlusOne UpdatedState
1 2
3 6
7 14

The above script introduces a process instance named myInstance with the syntax keep process. On the left side of the assignment, the first keyword keep indicates that the statement belongs to the realm of the process state; then, the second keyword process indicates that a process instance is about to be defined. On the right side of the assignment a function (either pure or process) is identified by its signature.

The process instance myInstance has a dual purpose. First, it can be used to access the state of the process as done in the next line statePlusOne = myInstance + 1. Second, it can be used to update the process instance as done in the line that follows updatedState = myInstance(a).

The syntax to define a process instance follows the pattern fname(type1, type2, type3), which starts with the function name, followed with the list of types accepted by the function. This design allows you to pick the right function, even in the presence of multiple overloads. When group arguments are present the syntax becomes fname(type1, type2, type3; arg1, arg2), which introduces the usual semicolon ; separator followed by the actual group argument values.

The following script illustrates the process instance syntax in presence of group arguments:

table Variants = with
  [| as Product, as Color, as Limit  |]
  [| "shirt", "pink", ", " |]
  [| "shirt", "white", ", " |]
  [| "socks", "green", " - " |]
  [| "socks", "yellow", " - " |]

table Products[Product] = by Variants.Product
Products.Limit = same(Variants.Limit)

def process myJoin(a : text; limit : text) with
  keep process myInstance = join(text; limit)
  return (myInstance, myInstance(a))

Variants.State, Variants.Updated =
                  myJoin(Variants.Color; Products.Limit)
                  by Variants.Product
                  scan Variants.Color

show table "Colors" with
  Product
  Variants.State
  Variants.Updated

Which displays the following table:

Product State Updated
shirt pink
shirt pink pink, white
socks green
socks green green - yellow

In the above script, the process instance is defined with join(text; limit), which mixes a data type text followed by an actual text value limit. From this point, myInstance behaves as if it doesn’t have a group argument anymore, as this argument has already been set. Hence, the call myInstance(a) is part of the return statement.

The definition of a process instance must include explicit values for the group arguments, if any. Those explicit values are subject to the same rules that govern state variables: the expression must be built from group arguments (of the very process being defined) or from previously defined state variables. When the process instance is later called to update its state, group arguments are omitted.

Finally, process instances can also be used when tuples are returned, as illustrated by the following script:

table Numbers = with
  [| as N |]
  [| 2 |]
  [| 4 |]
  [| 8 |]

def process MinMax(a : number) with
  keep min = +1000000
  keep max = -1000000
  if a < min
    min = a
  if a > max
    max = a
  return (min, max)

def process MinMax2(a : number) with
  keep process myInstance = MinMax(number)
  min, max = myInstance(a)
  return (min, max)

Numbers.Min, Numbers.Max = MinMax2(Numbers.N) scan Numbers.N

show table "" with Numbers.N, Numbers.Min, Numbers.Max

Which displays the following table:

N Min Max
2 2 2
4 2 4
8 2 8

In the script above, the process MinMax2() is a simple wrapper around MinMax() introduced for the prime purpose of illustrating how process instances can also return tuples, as done with min, max = myInstance(a).

User Contributed Notes
0 notes + add a note