Enum Types

Enums are complex data types intended to deliver better compute performance and to achieve a higher degree of programming correctness. An enum, shorthand for enumeration, is an immutable collection of text values. In practice, enums are used as a replacement for the text primitive data type.

Table of contents

Inline enums

An enum can be defined inline while explicitly listing all the allowed text values. This can be done with a table enum statement as illustrated by:

table enum Countries = "BE", "FR", "UK", "US"

show table "Countries" a1b4 with
  text(Countries.Value)
  "(\{Countries.Label})"

be = enum.Countries("BE")
show scalar "Belgium" c1 with text(be)

In the above script, an enum named Countries is introduced in the first line. The values of the enum are displayed by the table tile. Finally, a single text value is converted to its enum counterpart, back to text for display within a scalar tile.

The call to the function text(x : enum Countries) is actually unnecessary in the specific context of a show statement. In this context, the conversion is done automatically. Thus, the beginning of the above script could be simplified as:

table enum Countries = "BE", "FR", "UK", "US"

show table "Countries" a1b4 with
  Countries.Value
  "(\{Countries.Label})"

The inline declaration of an enum can also be done over multiple lines:

table enum Countries =
  "BE"
  "FR", "UK"
  "US"

The table enum Countries statement creates a series of elements:

  1. A data type named enum Countries.
  2. A table named Countries with 1 line per enum value.
  3. A vector named Countries.Label that contains the text values of the enum.
  4. A vector named Countries.Value that contains the values of type enum Countries.
  5. A special function named enum.Countries(x : text) to parse from text values.
  6. An overload of the text(x : enum Countries) function to convert to text values.

The enum labels are case-sensitive and any white space in the labels is significant. The enum values are unordered: they cannot be compared (via <) or sorted (via sort).

The table named after the enum can be used, among other things, to enumerate the values of the enum as it is done in the above script through the vector Countries.Value.

When defined inline, enums are eagerly processed during the script compilation. Thus, it is possible to have the definition of the enum appear, in the code, below the first use. For example, it is possible to rewrite the script above as:

show table "Countries" a1b4 with
  Countries.Value // enum auto-converted to text
  "(\{Countries.Label})"

table enum Countries = "BE", "FR", "UK", "US"

However, it is advised to keep enum declarations above the code that logically depends on those enums. This feature is intended for specific situations where colocating the declaration of the enum with another part of the script is more readable than an early declaration.

Under the hood, the Envision runtime replaces the text labels of an enum by compact identifiers, which can be processed more efficiently than text values.

As a rule of thumb, once the data preparation is complete, we recommend using enums for most text columns of limited cardinality. The detail of the enums’ limitations is given in the following. Enums offer a simple mechanism to avoid entire classes of programming mistakes, for example testing equality between a country code and a currency code.

Advanced remark: Enums represent a special case of what is typically known as a generic type in languages like C# or Java. They are the sole complex data type supported by Envision. The intent behind the Envision enums is to provide type safety to its relational algebra. Envision enums are similar in essence to the PostgreSQL and MySQL enums.

Matching and filtering

Enums benefit from matching and filtering capabilities that take advantage of their strongly typed nature. The following script illustrates the matching syntax:

table enum Countries = "BE", "FR", "UK", "US"

x = enum.Countries("UK")

y  = match x with
  "BE" -> 1
  "FR" -> 1 + 1
  "UK" -> 1 + 1 + 1
  "US" -> 1 + 1 + 1 + 1

show scalar "y" with y // displays '3'

In the above script, the match keyword is used to enumerate the values of the enum Countries type. On the left side of the match, the text literals are automatically converted into enum values in order to test for equality.

If a case is missing, the compilation fails as illustrated:

table enum Countries = "BE", "FR", "UK", "US"

x = enum.Countries("UK")

y  = match x with
  "BE" -> 1
  "FR" -> 1 + 1
  "UK" -> 1 + 1 + 1 // FAILS

show scalar "y" with y

However, all cases are not required to be listed explicitly, a fallback can be used with:

table enum Countries = "BE", "FR", "UK", "US"

x = enum.Countries("UK")

y  = match x with
  "BE" -> 1
  "FR" -> 1 + 1
  "UK" -> 1 + 1 + 1
  .. -> 1 + 1 + 1 + 1

show scalar "y" with y // displays '3'

In the above script, the token .. is used to indicate the case that is selected if none of the previous cases are matching.

The exhaustivity checks provided by Envision for the enums ensure that no case gets accidentally overlooked. This behavior is desirable to avoid certain classes of programming mistakes.

Trying to call the special function enum.Countries(x : text) against an invalid value fails. However, it is straightforward to test whether a given text value belongs to an enum:

table enum Countries = "BE", "FR", "UK", "US"

table Raw = with
  [| as Label  |]
  [| "BE"  |]
  [| "fr"  |]
  [| " US" |]

Raw.IsValid = Raw.Label in Countries.Label

show table "Countries" a1b3 with
  Raw.Label
  Raw.IsValid

In the above script, the expression Raw.Label in Countries.Label evaluates as a Boolean value, which indicates whether the text value Raw.Label belongs to enum Countries.

Also, a syntactic sugar, when an enum appears into an (in)equality expression, i.e. a comparison == or !=, from a semantic perspective, the enum value appears to be automatically converted to a text value to make the comparison possible:

table enum Countries = "BE", "FR", "UK", "US"

table Details = with
  [| as Label, as Name      |]
  [| "BE", "Belgium"        |]
  [| "FR", "France"         |]
  [| "UK", "United Kingdom" |]

fr = enum.Countries("FR")

where fr == Details.Label // auto conversion to 'text(fr)'
  show table "Countries" a1b2 with
    Details.Label
    Details.Name

In particular, this syntax alleviates the need to call the text() function on the enum value. Under the hood, the Envision runtime attempts to avoid an actual conversion of the enum value to its corresponding text value in order to minimize the performance overhead.

Enum-typed reads

Input files read by an Envision script can declare a column to be of an enum type. Moreover, the enums themselves can be defined based on the data observed in input files. In order to illustrate the affinity between enum and the read statements, let’s start by producing a flat file:

table C = with
  [| as Code, as Name       |]
  [| "BE", "Belgium"        |]
  [| "FR", "France"         |]
  [| "FR", "French Guiana"  |]
  [| "UK", "United Kingdom" |]

show table "" export: "/sample/countries.csv" with C.Code, C.Name

The above script writes a list of 4 entries into a flat text file named countries.csv. This script only needs to be run once. The following scripts in this section are reading this file.

The column of a table can be typed as an enum. This behavior is of prime interest to ensure that the column does not contain corrupt entries, which would wreak havoc downstream in the script itself. The following script illustrates how an enum defined inline can be used as a data type in a read block:

table enum Countries = "BE", "FR", "UK", "US"

read "/sample/countries.csv" as C with
  Code : enum Countries
  Name : text

show table "Countries" a1b3 with
  C.Code
  C.Name

The above script uses the syntax Code : enum Countries to declare the vector C.Code to be typed according to the enum. If the input file countries.csv were to contain a country code that wasn’t listed in enum Countries, then the read operation would fail at runtime.

However, the validity of an enum type found in a read block is only checked if the column happens to be used by the Envision script. The following script runs successfully:

table enum Countries = "BE" // "FR", "UK" missing

read "/sample/countries.csv" as C with
  Code : enum Countries
  Name : text

show table "Countries" a1b3 with C.Name // succeeds

The above script succeeds while the file countries.csv contains code that is not reflected in the declaration of the enum Countries because the vector C.Code is never used. As a result, the correctness of the content of the C.Code is not checked.

table enum Countries = "BE" // "FR", "UK" missing

read "/sample/countries.csv" as C with
  Code : enum Countries
  Name : text

show table "Countries" a1b3 with C.Code

An enum can also be defined directly from a read statement. Instead of explicitly declaring the enum values in the script, the values are extracted from the input files:

read "/sample/countries.csv" as C with
  Code : table enum Countries
  Name : text

show table "Codes" a1b3 with Countries.Label
show table "Names" c1d3 with C.Name

In the above script, the syntax table enum Countries is used to introduce the enum named Countries. While the table C has 4 lines, the table Countries has only 3 lines, as the enum represents a collection of distinct text values.

When a read block is used to declare an enum type, there are no checks involved beyond the capacity limits (see below): the enum values are the distinct values observed in the input file. If the input file contains incorrect enum values, those values end up in the definition of the enum. However, in practice, a first read block can be used to declare an enum (this file is assumed to be correct), while a second read block consumes the enum (the integrity of this file is checked against the first one).

Anonymous enums

Enums offer a superior compute performance compared to text values when the data has a low cardinality. Thus, when considering large tables, it can be of interest to apply the enum type to the majority of the text columns that do not contain more than a few thousand distinct values. In this situation, naming the enum is syntactic overhead. Thus, Envision offers a mechanism to declare anonymous enums within read blocks.

Let’s revisit the flat file countries.csv created in the previous section. The following script creates two distinct anonymous enum types:

read "/sample/countries.csv" as C with
  Code : table enum
  Name : table enum

show table "Countries" a1b3 with C.Code, C.Name

In the above script, the syntax table enum omits the name of the enum and triggers the creation of an anonymous enum instead.

It is also possible to create an anonymous enum based on an arbitrary text vector with the enum(..) function:

table Countries = with
  [| "BE" as Code |]
  [| "DE "|]
  [| "FR" |]
  [| "US" |]
  [| "BE" |] // duplicate

Countries.AnoEnum = enum(Countries.Code)

show table "Countries" a1b4 with text(Countries.AnoEnum)

In the above script, the function enum(..) returns an anonymous enum. A new overload for the function text(..) is also created to convert back the enum values to text. This overload is used in the expression text(Countries.AnoEnum).

Denanonymized enums

When an “anonymous” enum is used as the primary dimension of a table, the enum is named after the table itself. Let’s revisit the previous example:

read "/sample/countries.csv" as C[name] with
  Code : table enum
  Name : table enum

show table "Countries" a1b3 with C.Code, C.Name

In the above script, the enum C data type is introduced, and the enum is named after C, its defining table. The primary dimension of this enum table is name.

This mechanism can be seen as a minor design edge-case on anonymous enums.

Primary dimensions

The creation of an enum leads to the creation of a table sharing the same name as its originating enum. This table’s primary dimension has the same type as the enum itself and can be named explicitly:

table enum Countries[country] = "BE", "FR", "UK", "US"
show table "Countries" a1b4 with country

The primary dimension can also be named when the enum is declared as part of a read block. Revisiting the flat file countries.csv created in the previous section, this can be done with:

read "/sample/countries.csv" as C with
  Code : table enum Countries[country]
  Name : text

show table "Countries" a1b3 with country

It is also possible to use an enum as the primary dimension of the table being read:

read "/sample/countries.csv" as C[name] with
  Code : text
  Name : table enum

show table "Countries" a1b3 with name

In the above script, the type enum C is used for the primary dimension of the table C. The enum type is assigned through the syntax table enum, however, the enum table isn’t anonymous as it’s the table C itself.

When the primary dimension of a table is typed as an enum, if duplicate values are found for the enum, then the read block fails. This problem is illustrated with:

read "/sample/countries.csv" as C[code] with
  Code : table enum // Fails due duplicate value 'FR'
  Name : text

show table "Countries" a1b3 with code

Whenever a table contains a column that is expected to be a “well-behaved” primary dimension, it is recommended to strong type this dimension as an enum in order to benefit from the integrity checks performed by Envision.

SKUs (storage keeping units) represent a more interesting supply chain example. Let’s produce a minimal flat file illustrating a list of SKUs, each SKU having a location and product reference:

table SKUs = with
  [| "Paris" as Loc, "shirt-123" as Ref |]
  [| "London",       "shirt-123"        |]
  [| "Paris",        "pant-234"         |]
  [| "London",       "hat-345"          |]

show table "" export: "/sample/skus.csv" with SKUs.Loc, SKUs.Ref

In particular, the primary dimensions of the enums declared in the read block can be used to assign secondary dimensions to the table being read:

read "/sample/skus.csv" as SKUs[sku] expect [ref, loc] with
  Ref: table enum Ref[ref]
  Loc: table enum Loc[loc]

show table "SKUs" a1b3 with ref into SKUs, loc

In the above script, the secondary dimensions are declared via expect [ref, loc] above the code that declares the enums themselves.

Performance and limitations

An enum is limited to 100 million distinct values and 1 GB of text data, whichever comes first.

From the runtime perspective of Envision, a small vector of enum values can hold up to 100 million values, which is larger than a small vector of text values limited to 2.75 million values. Thus, for language constructs like each blocks and autodiff blocks, which expect to operate iteration-wise with chunks of data that fit in a page, it is recommended to use enum whenever possible.

As a rule of thumb, when considering a text column in a read block, if the script logic does not sort against this column, and if its cardinality (i.e. number of distinct values) is less than 10,000 distinct values then, we suggest to type this column as an anonymous enum instead.

Beyond 10,000 distinct values, there might be a gain of performance when converting a text column into an enum, but a loss of performance may also happen due to the cost involved in creating the underlying dictionary.