Path schemas
A path schema binds one or several files to a list of typed fields. Schemas are used for reading tables from files or writing tables to files, without having to repeat the list of field names every time, and avoiding typos to creep-in in the file paths themselves.
Table of contents
Schema overview
Schemas are intended as a means to stabilize and document the format of the tabular files either written or read by Envision scripts. Schemas help to avoid entire classes of programming mistakes rooted in the file-based interactions between scripts. For example, changing the name of a field in an output file can inadvertently break another script, executed downstream, as an expected field isn’t going to be found anymore. Moreover, if the same file is read in multiple scripts, then the use of a schema can significantly reduce the verbosity of those scripts.
Let’s introduce our first schema, and use it to produce a more concise write statement:
schema '/sample/products.csv' with
Product : text
Color : text
Price : number
table Products = with
[| as Product, as Color, as Price |]
[| "shirt", "white,grey", 10.50 |]
[| "pants", "blue", 15.00 |]
[| "hat", "red", 5.25 |]
write Products as '/sample/products.csv'
In the above script, the schema '/sample/products.csv'
is defined with three named and typed fields. The schema definition is introduced by the keyword schema
followed by a path literal (not to be confused with a text literal). The write
statement refers to the schema identified through its path, and thus, omits the fields entirely.
The schema can also be used on the read-side:
schema '/sample/products.csv' with
Product : text
Color : text
Price : number
read '/sample/products.csv' as Products
show table "My Products" a1b3 with
Products.Product
Products.Color
Products.Price
In the above script, the read
statement also refers to the schema identified through its path. This allows, again, to omit the fields entirely.
Field documentation
The fields introduced by a schema can be documented. This documentation becomes available by hovering over the vectors in the scripts that use the schema.
schema '/sample/products.csv' with
/// The product identifier.
Product : text
/// The 3-letter color code.
Color : text
// The VAT-included unit price.
Price : number
In the above script, the comments introduced by ///
are situated above the line that contains the field declaration that each comment is referring to.
Field renaming
Field’s names, as found in files, may not be compatible with Envision’s variable names. For example, the name may include a whitespace. Envision provides a mechanism to rename a field.
schema '/sample/products.csv' with
Product : text
Color : text
Price : number
VAT : number = read("value added tax") // renaming the field
table Products = with
[| as Product, as Color, as Price, as VAT |]
[| "shirt", "white,grey", 10.50, 0.2 |]
[| "pants", "blue", 15.00, 0.2 |]
[| "hat", "red", 5.25, 0.2 |]
write Products as '/sample/products.csv'
In the above script, the VAT
field is introduced with an assignment read("value added tax")
that renames the field originally named value added tax
as VAT
. Thus, the CSV file produced through the write
statement below contains a fourth column named value added tax
.
A file written with a schema can be read with the same schema.
schema '/sample/products.csv' with
Product : text
Color : text
Price : number
VAT : number = read("value added tax") // renaming the field
read '/sample/products.csv' as Products
show table "My Products" a1d3 with
Products.Product
Products.Color
Products.Price
Products.VAT
In the above script, the read
statement finds a column named value added tax
in the CSV file. This column gets attached to the VAT
field.
The field renaming mechanism is the canonical example of a field rebinding that happens symmetrically on the write side and on the read side.
Field rebinding is typically intended to cope with field names that are not compatible with the variable naming rules of Envision, but it can also facilitate refactoring operations, removing the need to immediately rewrite the files when the corresponding variable names are changed.
Field rebinding on write
When writing against a path schema, all the fields specified by the schema are expected to found among the vectors of the table. The mechanism where each schema field gets attached to its counterpart vector is refered to as binding. However, sometimes the vectors present in the table may not exactly match the expectations of the schema. Thus, Envision offers a mechanism to rebind, within the write
block the fields to newly introduced vectors.
schema '/sample/products.csv' with
Product : text
Color : text
Size : text
Price : number
table Products = with // 'Size' is missing
[| as Product, as Color, as Price |]
[| "shirt", "white,grey", 10.50 |]
[| "pants", "blue", 15.00 |]
[| "hat", "red", 5.25 |]
write Products as '/sample/products.csv' with
Size = "XL" // Completing 'Size'
Color = uppercase(Products.Color) // Redefining 'Color'
In the above script, the write
block interferes with the binding of two fields. First, the field Size
, that does not have a counterpart in the table Products
is assigned a constant text value "XL"
. Second, Color
that has a counterpart gets a new one via uppercase(Products.Color)
.
Field rebinding is a syntactic sugar. This mechanism can alleviate or postpone the need for extensive refactoring of existing scripts when schemas themselves evolve.
Advance remark: There is no isolated rebinding on read because the path schema precisely hint that the files at the specified path follows the structure as specified by the path schema. On the contrary, the named schema, detailed later, offers an isolated rebinding on read, as the named schema isn’t coupled with specific files.
Path literals
The path is a special datatype that identifies a file or a list of files inside the Lokad filesystem, and also identifies a schema - as illustrated in the previous section. Path literals should not be confused with text literals.
const myPath = '/foo.csv' // single-quote delimiters
const MyText = "/foo.csv" // double-quote delimiters
For display purposes, path values are automatically converted to text values:
const myPath = '/foo.csv'
show scalar "" a1 with myPath
Path prefixing
Paths support a limited set of operations. Paths prefixing is one of those operations. It offers the possibility to isolate a prefix folder from the rest of the path.
schema '/sample/products.csv' with
Product : text
Color : text
Price : number
const myFolder = '/sample'
read '\{myFolder}/products.csv' as Products
show table "My Products" a1b3 with
Products.Product
Products.Color
Products.Price
write Products as '\{myFolder}/products.csv'
In the above script, the variable myFolder
defines a prefix folder. This variable is used in the read
and write
blocks that follows using the \{..}
inclusion operator.
Beware, the path prefixing operator \{..}
should not be confused with text interpolation. The operator must be positioned at the beginning of the path.
Path schema cloning
When working over a draft version of a script, or when working over the next version of a data pipeline, it can be useful to relocate the files being written and read to avoid interfering with the production. Path schema cloning is a mechanism that facilitates such a relocation of files by introducing a schema that only points to a distinct location.
read '/sample/products.csv' as Products max 1000 // 'read' comes first, cap at 1000 lines
schema '/production/products.csv' with
Product : text
Color : text
Price : number
schema '/sample/products.csv' = '/production/products.csv' // clone one
show table "My Products" a1b3 with
Products.Product
Products.Color
Products.Price
In the above script, the first schema points to the /production
folder, which we want to avoid. Thus, a second schema, clone of the first, is introduced at line 6. At line 10, the read
block refers to the second schema.
In practice, path schema cloning becomes relevant when the schema itself is isolated into a module and shared between the two environments. The cloning operation allows to preverve the original schema, while changing the path.
Cloning can also be performed over all the schemas sharing the same prefix.
read '/sample/products.csv' as Products max 1000 // 'read' comes first, cap at 1000 lines
schema '/production/products.csv' with
Product : text
Color : text
Price : number
schema '/sample' = '/production' // clone all starting with /production
show table "My Products" a1b3 with
Products.Product
Products.Color
Products.Price
In the above script, cloning statement at line 6 refers all the schemas that match the specified path prefix.
This mechanism facilitates the setup of a second flow, typically a “development” flow, beside the “production” one. It minimizes the amount of changes in the Envision code to prevent the secondary flow from interfering with the primary one.
Parameterized paths
The data of a table can be partitioned into multiple files. Partitioning is primarily intended for scalability purposes. It can be used to reduce bandwidth, by filtering the files themselves through the partition. A parameterized path refers to a list of files - not just a single file - produced through a partitioned write. The parameterized path is used to re-consolidate data originating from multiple files into a single table.
Let’s start with a script that partitions a table into multiple files.
schema '/sample/products-\{Bucket}.csv' with
Bucket : number
Product : text
Color : text
Price : number
table Products = with
[| as Product, as Color, as Price, as Bucket |]
[| "shirt", "white,grey", 10.50, 1 |]
[| "pants", "blue", 15.00, 2 |]
[| "hat", "red", 5.25, 3 |]
[| "hat", "blue", 5.25, 3 |]
write Products partitioned as '/sample/products-\{..}.csv'
In the above script, a schema with a variable named Bucket
is introduced. This variable has the number
datatype. The write
block refers to the schema, replacing the Bucket
variable by a selector, here ..
that means capture all numbers. The keyword partitioned
indicates that multiple files may be produced by the write
block. Indeed, this write
block produces 3 files respectively named /sample/products-1.csv
, /sample/products-2.csv
and /sample/products-3.csv
.
The variable Bucket
is a path parameter of the path schema.
The files of the partition can, in turn, be read using the same schema.
schema '/sample/products-\{Bucket}.csv' with
Bucket : number
Product : text
Color : text
Price : number
read '/sample/products-\{..}.csv' as Products
show table "My Products" a1d4 with
Products.Bucket
Products.Product
Products.Color
Products.Price
In the above script, the read
block refers to the schema introduced line 1. The Bucket
value is found both in the file name, and within the file itself.
The acceptable datatypes for path parameters are: text
, number
, date
, week
, month
. A single path can include multiple path parameters as long as they are delimited by a separator.
Parameterized paths are to write partitions through write .. partitioned
statements, and read them afterward. This mechanism is intended to support internal processing steps within an Envision data flow. This mechanism is not intended to consolidate raw multi-file extractions as obtained from a third-party business system.
Two path schemas cannot capture the same file, as this situation is treated by the Envision compiler as an error. This behavior is guaranteed even if parameterized paths are present. If there exist a combination of path parameter values that create a collision between two path schemas, then the script does not compile.
Path capture and file deletion
All the files captured by the parameterized path get overwritten when writing over the path schema. This point is a subtle but important aspect of the partitioned write.
Let’s revisit the first script of the previous section. Let’s assume that the 3 files respectively named /sample/products-1.csv
, /sample/products-2.csv
and /sample/products-3.csv
exist in the file system.
schema '/sample/products-\{Bucket}.csv' with
Bucket : number
Product : text
Color : text
Price : number
table Products = with
[| as Product, as Color, as Price, as Bucket |]
[| "shirt", "white,grey", 10.50, 1 |]
[| "pants", "blue", 15.00, 2 |]
write Products partitioned as '/sample/products-\{..}.csv'
The above script overwrites /sample/products-1.csv
and /sample/products-2.csv
, but it deletes the file /sample/products-3.csv
.
Indeed, the write operation with a parameterized path works as follow:
- List all the files that are captured by the pattern.
- Create new files, if needed.
- Overwrite existing files, if needed.
- Prune empty files.
The behavior ensures the symmetry between the write and the read operations. If files, captured by the parameterized path, were left untouched (for example /sample/products-3.csv
), then the read operation would yield a table that would not be line-wise identical to the table that originally fed to the write operation.
Bounded paths
The list of files captured by the parameterized path can be constrained through path bounds. Let’s revisit the script example of the previous section. Let’s assume again that the 3 files respectively named /sample/products-1.csv
, /sample/products-2.csv
and /sample/products-3.csv
exist in the file system.
schema '/sample/products-\{Bucket}.csv' with
Bucket : number
Product : text
Color : text
Price : number
table Products = with
[| as Product, as Color, as Price, as Bucket |]
[| "shirt", "white,grey", 10.50, 1 |]
[| "pants", "blue", 15.00, 2 |]
const lowerIncl = 1 // inclusive lower bound
const higherIncl = 2 // inclusive higher bound
write Products partitioned as '/sample/products-\{lowerIncl..higherIncl}.csv'
The above script overwrites /sample/products-1.csv
and /sample/products-2.csv
, but it leaves untouched the file /sample/products-3.csv
. The constant value variables lowerIncl
and higherIncl
act as inclusive lower and higher bounds respectively. Indeed, as those bounds are present, the file /sample/products-3.csv
is not captured by the pattern, and hence, not overwritten.
The bounds must be compile-time constants. However, the bounds are optional, thus, '/sample/products-\{lowerIncl..}.csv'
and '/sample/products-\{..higherIncl}.csv'
would also be valid bounded paths. Conversely, the path '/sample/products-\{..}.csv'
is unbounded (and valid as well).
The bounded path syntax applies to the read side as well.
schema '/sample/products-\{Bucket}.csv' with
Bucket : number
Product : text
Color : text
Price : number
const lowerIncl = 1
const higherIncl = 2
read '/sample/products-\{lowerIncl..higherIncl}.csv' as Products
show table "My Products" a1d4 with
Products.Bucket
Products.Product
Products.Color
Products.Price
In the above script, only the files /sample/products-1.csv
and /sample/products-2.csv
get read. The content of the file /sample/products-3.csv
is never touched, as the file gets filtered by the bounded path itself.
In practice, bounded paths are the mechanism which deliver I/O gains. The partitioned writes simply ensure the correctness of the partition while considering repated write operations performed by a flow that produces files “in place” on a schedule.
Table size constraint on schema
Table size limits are used in two ways in Envision. First, those limits unlock certain features that can’t operate over arbitrarily large vectors. Second, it avoid accidental performance issues when processing tables are vastly larger than what was expected at the time the script was written.
The declaration of a schema can include a cap on the number of lines of the resulting table.
schema '/sample/products.csv' max 10 with
Product : text
Color : text
Price : number
In the above script, the max
keyword is followed by the inclusive maximal number of lines to be associated with the path schema. At runtime, if a script attempts to write or read more than 10 lines while the write
or read
block are bound to this path schema, then, execution will fail.
Enum downcast of a text field
Enums may provide an increased performance, especially if the cardinality of the field is low (i..e less than 10,000 distinct values). Envision provides a mechanism to downcast a field originally declared as text
in its schema into an enum
.
schema '/sample/products.csv' with
Product : text
Color : text
Price : number
read '/sample/products.csv' as Products with
Color : table enum Colors
show table "My Products" a1b3 with
Products.Product
Products.Color
Products.Price
In the above script, Color
is declared as a text
field in the schema
block. The downcast toward an enum
happens in the read
block below.
This feature can be combined with the declaration a primary dimension for the enum
itself, and the declaration of secondary dimension for the table associated to the path schema.
schema '/sample/products.csv' with
Product : text
Color : text
Price : number
read '/sample/products.csv' as Products expect [color] with
Color : table enum Colors[color]
show table "My Products" a1b3 with
Products.Product
color
Products.Price
In the above script, inside the read
block, the enum Colors
gets color
as its primary dimension. This dimension is set as a secondary dimension of the table Products
through expect [color]
on the first line of the read
block.
Field aliasing on read
A schema may introduce a field that ends up creating a naming conflict in a script. To mitigate those situations, Envision offers the possibility to alias a field within a read
block.
read '/sample/products.csv' as Products max 1000 with // 'read' comes first, cap at 1000 lines
ColorAlias = Products.Color
schema '/sample/products.csv' with
Product : text
Color : text
Price : number
show table "My Products" a1b3 with
Products.Product
Products.ColorAlias
Products.Price
In the above script, ColorAlias
is introduced within the read
block as an alias (same datatype, different name) of the original Color
field.
In order to understand while aliasing can be used to mitigate an actual name conflict, let’s introduce a slightly more complex example.
schema '/T.csv' with
Id : text
read "/U.csv" as U[Id] with
Id : text // 'Id' collides with the schema
read '/T.csv' as T[Ref] with
Ref = T.Id // Aliasing on read 'Id' as 'Ref'
X : number = 2
show table "T" a1b2 with
Ref
T.X
In the above script, a path schema is introduced with a field named Id
. Below, a first read
block introduces another table - independent from the path schema - that “accidentally” happens to have Id
as its primary dimension. In the second read
block, the Id
field - which would have conflicted with the dimension named Id
is aliased as Ref
. This alias serves as the primary dimension of the table T[Ref]
.
Modules and path schemas
In practice, the path schema is intended to be isolated into a module, in order to avoid duplicating the schema
block between the two distinct scripts respectively handling the write operation and the read operation.
The path schema must be marked as export
in the module:
// Module named '/sample/my-module'
export schema '/sample/products.csv' with
Product : text
Color : text
Price : number
Once the module is imported, the path of the read
block gets matched with the path schema found in the module.
import "/sample/my-module" as M
read '/sample/products.csv' as Products
show table "My Products" a1b3 with
Products.Product
Products.Color
Products.Price