...

Read dimensions

Storing tables in files is convenient but unlike storing data in a relational database, structured relations between tables are not a given and integrity problems may arise. For example, nothing prevents a flat text file from containing duplicated lines. Thus, Envision offers a mechanism to add relational structure to a set of tables being read as inputs. Those mechanisms are not only desirable from a correctness perspective, as it prevents certain classes of problems from happening in the first place (e.g. having a sales order referencing a non-existent product), but they are also desirable thanks to the built-in behaviors that Envision offers when those relations are declared.

Let’s revisit the example from the previous section considering two tables Products and Variants. Each variant is supposed to be attached to a product, and can only vary in color compared to the original product. The following script illustrates how to generates two CSV files, one for the products and the other for the variants:

table Products = with
  [| as Product, as Price |]
  [| "shirt", 10.50 |]
  [| "pants", 15.00 |]
  [| "hat", 5.25 |]

table Variants = with
  [| as Product, as Color |]
  [| "shirt", "white" |]
  [| "shirt", "pink" |]
  [| "pants", "blue" |]
  [| "pants", "black" |]
  [| "hat", "red" |]

write Products as "/sample/products.csv" with
  Product = Products.Product
  Price = Products.Price

write Variants as "/sample/variants.csv" with
  Product = Variants.Product
  Color = Variants.Color

The following script reads both files, and loads two tables Products and Variants and then displays the number of available colors, among the variants, for every product:

read "/sample/products.csv" as Products[product] with
  "Product" as product : text
  Price : number

read "/sample/variants.csv" as Variants expect [product] with
  "Product" as product : text
  Color : text

Products.Colors = distinct(Variants.Color)

show table "Color Counts" a1b3 with
  product
  Products.Colors

The script above starts by declaring product as the primary dimension of the table Products. The primary dimension is introduced between brackets right after the table name. As the name of the dimension matches (case-insensitive) the name of one of the file’s input columns, namely Product: text below, this column is used as the primary dimension. If duplicates were to be found in the Product column, the read operation would fail at runtime indicating that the unicity constraint of the primary dimension has been violated.

The table Variants is declared with product as a foreign dimension. This declaration follows the expect keyword (itself following the table name), and the name of the referenced dimension is put between brackets. If values were to be found in the Variants.Product column without having a counterpart in the product dimension, the read operation would fail indicating that the foreign dimension constraint has been violated.

Thus, if both read operations succeed, Envision guarantees that the table Products has no duplicates for Products.Product, and that every single line in Variants has exactly one counter line in the Products table. This latter property can be rephrased in a more Envision-esque perspective: due to the foreign dimension, the table Products can be broadcast into the table Variants.

This broadcast retaliationship makes it possible to define Products.Colors as done above by leveraging the natural join that exists between the two tables. Indeed, for every single of the Products table, we can identify without ambiguity the matching lines in the Variants table through the pairs Products.Product and Variants.Product.

If we had omitted the dimension declarations, it would have remained possible to produce a similar effect with:

read "/sample/products.csv" as Products with
  Product : text
  Price : number

read "/sample/variants.csv" as Variants with
  Product : text
  Color : text

Products.Colors = distinct(Variants.Color) by Variants.Product at Products.Product

show table "Color Counts" a1b3 with
  Products.Product
  Products.Colors

In the script above, the explicit by at aggregation option tells Envision how to pair the lines in the two tables Products and Variants.

As a rule of thumb, it’s better to declare dimension relationships to the greatest extent through the read statements rather than to rely on the by at options. Not only does Envision have built-in integrity checks, but the code tends to be more concise and more readable, especially when the pairing ends-up being repeated numerous times in the script.

Multiple dimensions can also be referenced. Let’s revisit the script introduced above with a third table named Colors exported to a third file as well with:

table Products = with
  [| as Product, as Price |]
  [| "shirt", 10.50 |]
  [| "pants", 15.00 |]
  [| "hat", 5.25 |]

table Colors = with
  [| as Color |]
  [| "black" |]
  [| "blue" |]
  [| "pink" |]
  [| "white" |]

table Variants = with
  [| as Product, as Color |]
  [| "shirt", "white" |]
  [| "shirt", "pink" |]
  [| "pants", "blue" |]
  [| "pants", "black" |]
  [| "hat", "red" |]

write Products as "/sample/products.csv" with
  Product = Products.Product
  Price = Products.Price

write Colors as "/sample/colors.csv" with
  Color = Colors.Color

write Variants as "/sample/variants.csv" with
  Product = Variants.Product
  Color = Variants.Color

These three files can be loaded with with the following script:

read "/sample/products.csv" as Products[product] with
  "Product" as product : text
  Price : number

read "/sample/colors.csv" as Colors[color] with
  "Color" as color : text

read "/sample/variants.csv" as Variants expect [product, color] with
  "Product" as product : text
  "Color" as color : text

Products.Colors = distinct(Variants.color)
Colors.Products = distinct(Variants.product)

show table "Color per Product" a1b3 with
  product
  Products.Colors

show table "Product per Color" c1d3 with
  color
  Colors.Products

The tables Products and Colors both declare a primary dimension, product and color. The table Variants declare those two dimensions as foreign dimensions, using the comma (,) as the delimiter within the brackets [ and ]. The Variants table benefits from two natural joins, illustrated by the respective definitions of Products.Color - counting the number of distinct colors for every product - and of Colors.Products - counting the number of distinct products for every color.

Any arbitrary number of foreign dimensions can be declared after the expect keyword following the syntax [dim1, dim2, dim3, ..]. Every referenced dimension generates a broadcast relationship between the two tables, and also ensures - at runtime - that all the values are found in their corresponding matching dimension.

The declarations of the primary dimensions must precede any use of the same dimension as a secondary dimension.

As a final note on dimensions, it is also possible to declare a primary dimension that does not match any of the columns declared by the read statement.

read "/sample/products.csv" as Products[id] with
  Product : text
  Price : number

The primary dimension id does not match any column, and ends up being automatically generated with one unique identifier per line in the original file. The datatype of id is ordinal, which is an intentionally opaque type that does not offer much by way of syntax in Envision.

User Contributed Notes
0 notes + add a note