So far, Envision has been used to read and write CSV files, however the Lokad platform supports reasonably diverse file formats beyond CSV. The choice of the formats has been driven by situations that Lokad has encountered while dealing with real supply chains. Most of the formatting options, both for read and write, resolve around flat text files that are ubiquitous in supply chains for good reasons (simplicity, portability, scalability, etc). However, as the notion of flat text files is relatively vague, Envision provides the capabilities to cope with all the variants that are encountered in practice.
Spreadsheets represent the second most important format, more specifically the Microsoft Excel 2007 file format, i.e. the
.xlsx format. This format is tricky - to say the least - because there is a fine line between too little capabilities and too many capabilities. Indeed, spreadsheets can spread programs (macros) leading to various security problems. Also, some features are simply not suitable for a production data pipeline, such as embedding a spreadsheet file within another spreadsheet file.
Finally, the last notable format is the platform-specific binary format of Lokad known as Ionic. This format has a strong affinity to Envision itself: it is strongly-typed and its datatypes mirror those of Envision. This format has also a strong affinity to Lokad’s data processing infrastructure, which yields superior performance compared to the other formats.
As a rule of thumb, flat text files are the best for system-to-system interop, spreadsheets are the best when end-users create and modify the files, and Ionic is the best for all the inter-script communications within the Lokad platform itself.
Advanced remarks: Our list of supported file formats is a relatively mature feature area in the Lokad platform. Format-wise, it has been a few years since we added anything major. For example, while XML and JSON are well-known formats, they tend to be absent in most supply chain setups, and those formats tend to be (nearly) unsupported by the surrounding toolchain - which goes beyond Lokad. This does not imply that Lokad won’t ever add any extra formats, merely that we follow the dominant trends in this specific software ecosystem.
Table of contents
Table level read options
The read options at the table level typically control the overall parsing behavior of the input files. As a rule of thumb, whenever one ends up resorting to those options, the input files can be reasonably considered as non-mainstream, one way or another. Envision has default behaviors that match sound and widespread practices concerning flat files. Thus, prior to using those options, it’s usually worth investigating whether the format can be adjusted upstream.
The following script revisits the parsing of the CSV file previously generated, but spells out the default parsing settings.
read "/sample/products.csv" \ fileformat: "csv" quotes: false skip: 0 separator: "," as Products with Product : text Price : number show table "Products" with Products.Product Products.Price
read statement, the table-level options follow the file pattern adopting a
key: value syntax. The first line of the above script ends with
\ which is a line continuation, only used for aesthetic purposes to break a line that would otherwise be quite long.
All the values have to be specified as literals. Unlike the file pattern itself, compile-time constants cannot be used. The following options are available:
| Option | Type | Default | Type of file | Comment |
fileformat | text | N/A | any | Applies the specified extension to all input files. Any recognizable extension, e.g.
xslx can be specified. This option overrides the default auto-detection of the file format. |
skip | number | 0 | text or Excel | Indicates how many lines should be skipped at the beginning of each file while files are being read. This option is intended for situations where the first lines are not yet the tabular data, but typically control or are the heading lines. |
encoding | text | N/A | text | Specifies the text encoding to be used while parsing files. The accepted text values are
latin1. The option overrides the default auto-detection of the file encoding. |
headers | text | N/A | text | Specifies the heading line to be used instead of the first line found in every file. These headers are intended to list the column found in the file. If headers are specified to override existing headers - as opposed to cope with missing ones - then
skip: 1 should be used. |
quotes | boolean | false | text | If
false then Envision escapes extreme quotes, as is usually done with CSV (comma-separated values) files. If
true then the extreme quotes found in a cell token are considered to be part of the token and not escaped. |
separator | text | N/A | text | Indicates the separator between cell values to be used when parsing files. This option overrides the default auto-detection that determines the most likely cell separator. |
date | text | N/A | text | Format specifier to parse dates. This option overrides the default auto-detection that determines the most likely date format. |
number | text | N/A | text | Format specifier to parse number. This option overrides the default auto-detection that determines the most likely number format. |
The default behavior of several options - e.g.
fileformat - cannot be obtained explicitly, only implicitly, by omitting the option altogether. These situations are associated with the auto-detection heuristics used by Envision that are - to a limited extent - loosely specified. On the contrary, once an option is specified, it follows a rigid behavior that is guaranteed to be enforced. Nevertheless, we recommend not specifying these options unless necessary. In our experience, these auto-detection heuristics are reliable and compatible with a production-grade data pipeline.
The file format analysis is based on the file name extension, and goes in several stages. If the file name ends with
.ion, then the file is recognized as an Excel spreadsheet or an Ionic file respectively. If the file name ends with
.zip then, the file is recognized as flat text file. Unless a
fileformat is specified, any other file name ending is considered as unknown, and leads to a parsing failure.
If the file name ends with
.zip, then, the file is decompressed respectively through the GZIP or ZIP formats. While ZIP is an archive format, Envision expects that the archive only contains a single file, and will fail otherwise. These files are decompressed, and the parsing resume.
When parsing a flat text file, unless specified otherwise, Envision automatically detects its format (e.g. separator, numbers and dates). In particular, whether the file is named
.tsv has no impact on those format detection heuristics, not even for the choice of the value separator.
As a tangential remark, header-free flat files should be avoided wherever possible. While the
headers option is available to cope with such files, relying on the option invariably makes the data pipeline fragile. Without headers, it becomes difficult to investigate any versioning problem related to file formats. For example, the inversion of two columns (assuming that the columns have the same data type) can silently permeate the data pipeline and corrupt most later calculations.
The fine print of the
number attributes is detailed in the next section.
Advanced remark: The GZIP and ZIP format should not be confused. ZIP is an archive format, and was historically associated with the WinZip utility on Microsoft Windows. GZIP is a pure compression format, and was historically part of the default GNU/Linux utilities. As a rule of thumb, the GZIP format is typically more appropriate for most data pipeline use cases. ZIP may have a role when end-users have to routinely manually upload large flat text files to Lokad while not having access to high-speed internet connections. In those cases, the ZIP format can alleviate some of the upload overhead. ZIP is somewhat of a second-class citizen from the Lokad perspective because we don’t embrace - on purpose - the archive nature of the format.
Roadmap: the table-level read options are required to be literals rather than compile-time constants, however, we intend to lift this constraint in the future.
Date and number options
Envision supports two mini-syntaxes respectively dedicated to the formatting (and parsing) of dates and numbers. Two syntaxes are invoked via the attributes
number that we introduced in the previous section.
Let’s illustrate this capability with two following scripts. The first script generates a flat file named
products.csv with two formatting conventions for dates and numbers respectively:
table Products = with [| as Product, as Price, as LaunchDate |] [| "shirt", 10.50, date(2020, 06, 25) |] [| "pants", 15.00, date(2020, 07, 12) |] [| "hat", 5.25, date(2020, 08, 04) |] show table "Products" export:"/sample/products.csv" \ date:"yy/MM/dd" number:"0,0" with Products.Product Products.Price Products.LaunchDate
Then, the second script reads the file named
products.csv with the same two formatting conventions, again for dates and numbers respectively:
read "/sample/products.csv" \ date:"yy/MM/dd" number:"0,0" as Products with Product : text Price : number LaunchDate : date show table "Products" with Products.Product Products.Price Products.LaunchDate
In the scripts above, the
date format indicates that the date will be composed of the year (two digits), followed by the month (two digits), followed by the day (two digits), using the slash
/ as the separator. Similarly, the
number format indicates that no thousand separator is used, but that the comma
, is used as the decimal separator.
More generally, the syntax for
d: day of the month from 1 through 31.
dd: same as
dbut 0 prefixed.
ddd: abbreviated day of the week (ex: Mon, Tue…).
dddd: full day of the week (ex: Monday, Tuesday…).
M: month number from 1 through 12.
MM: same as
Mbut 0 prefixed.
MMM: abbreviated name of the month (ex: Jan, Feb, …).
MMMM: full name of the month (ex: June).
yy: year number from 00 to 99, ignore the hundreds and thousands.
yyyy: year with four digits (ex: 2020).
The syntax for
number is as follows:
- The character after the initial 1 is the thousand separator. It can be omitted.
- The character before the last zeroes is the decimal separator. It can be omitted
Envision provides a symmetric behavior on the read and write sides. If a specified format is used to write data, assuming no data gets lost, then the same data can be read with the same options.
These formatting and parsing behaviors should not be confused with presentation behaviors offered by StyleCode, which will be covered in greater detail in a later section. These options only impact how data is read from files and how it is written to files. The appearance of the dashboard is not impacted by those options.
Renaming read columns
The name of a column, as found in the flat file, may not be a valid variable name in Envision. Thus, Envision provides a mechanism to rename columns as found in input files. So far, we have only seen the basic column read syntax, as illustrated by:
read "/sample/products.csv" as Products with Product : text Price : number
In the above script, there are two column reads,
Product : text and
Price : number. The column name comes first, followed by a semicolon and the data type.
The flat files and spreadsheets are restricted to the following data types, both for read and write:
boolean. Ionic files support all data types except
markdown. Specifying the data type is mandatory in Envision.
Let’s refresh the file
products.csv with the following script:
table Products = with [| as Product, as Price |] [| "shirt", 10.50 |] [| "pants", 15.00 |] [| "hat", 5.25 |] show table "Products" export: "/sample/products.csv" with Products.Product as "My Product" Products.Price
The above script is almost identical to the one we had introduced in a previous section, except for the line
Products.Product as "My Product" which specifies
"My Product" as an alternative column name in the file to be exported. This name is precisely chosen because it is not compatible with an Envision variable name due to its inner whitespace. There are many more write options, we will revisit those in a later section.
The syntax we have introduced so far would not be compatible with such an invalid column name, however, this situation can be addressed with:
read "/sample/products.csv" as Products with "My Product" as Product : text Price : number
The above script leverages the
as keyword in order to introduce
"My Product" as the text literal that will be used to match the column name as found in the file. If this script is executed right after the previous one, the sequence works.
Unlike valid variable names, the text literal specified through
as can be any text value, including whitespaces or accentuated characters for example. The syntax is symmetric from
Products.Product as "My Product" to specify an arbitrary column name on the write side, to
"My Product" as Product : text to specify it on the read side.
Reading heterogeneous files
When reading a single table from many files, all its underlying files may not have the same columns. For example, the
Orders table might be the consolidation of the data obtained from a now-defunct legacy ERP and from the current one. The respective file formats may be close enough to warrant a unified read, and yet, the subtle differences between the formats need to be accounted for.
Let’s consider a situation where the list of products is spread over two files
goods.csv that exhibit divergent column names as illustrated by:
table Products = with [| as Product, as Price |] [| "shirt", 10.50 |] [| "pants", 15.00 |] [| "hat", 5.25 |] table Goods = with [| as Good, as Price |] [| "t-shirt", 4.25 |] [| "bandana", 3.75 |] show table "Products" export: "/sample/products.csv" with Products.Product Products.Price show table "Products" export: "/sample/goods.csv" with Goods.Good Goods.Price
products.csv has two columns
Price, while the file
goods.csv has the columns
Price. Let’s see how these two files can be consolidated into a single
Products table. This process is illustrated with:
read "/sample/products.csv" as Products with Product : text Price : number read "/sample/goods.csv" as Products with "Good" as Product : text show table "" a1c3 with Products.Product Products.Price
In the above script, the first
read statement is identical to the one we have seen in a previous section. However, the second
read statement, the one that focuses on
goods.csv is quite different. This second block is also associated with the
Products table, but only a single column is specified, and it’s precisely the column that needs to be renamed from
A single table can be produced from multiple
read statements. The first
read statement must specify all the columns, their types and all the relationships (i.e. primary and foreign dimensions) of the table. Later the
read statement should only list the divergences.
Also, it is possible to supplement one or several columns that happen to be missing from files. Let’s assume that the column
Price is missing from the file
products.csv while the column
Product is missing from the file
goods.csv. The following script addresses this situation:
read "/sample/products.csv" as Products with Product : text const(11.5) as Price read "/sample/goods.csv" as Products with const("hat") as Product show table "" a1c3 with Products.Product Products.Price
The special function
const() is used to replace the
Price column with a constant set at
11.5 and then it is used to replace the
Product column with the constant text literal
More generally, the function
const() can also be used with booleans (ex:
const(true)) and dates (ex:
const(2020, 8, 25) for August 25th, 2020). The data type returned by
const() is unambiguous and based on its calling argument, as a result, when
const() is used, there is no need to specify the read column’s data type anymore.
Multi-value column read
Within a flat text file, a single column may contain multiple values, typically separated by a separator that happens to be distinct from the primary separator of the file. Let’s consider the following script that produces a CSV file where the
Colors column implicitly contains a list of values.
table Products = with [| as Product, as Colors, as Price |] [| "shirt", "white;pink", 10.50 |] [| "pants", "black;blue;grey", 15.00 |] [| "hat", "brown", 5.25 |] show table "Products" export: "/sample/products.csv" with Products.Product Products.Colors Products.Price
This file can then be read with the following script that leverages the
read "/sample/products.csv" as Products with Product : text "Colors" split:";" as Color : text Price : number show table "Products" a1c3 with Products.Product Products.Color Products.Price
split option precedes the
as keyword in the definition of the read column. It is followed by a text literal that defines the expected separator. Also, for the sake of clarity, the column
Colors is renamed
Color (singular) because, due to the split, there is only one color left per line in every
Although renaming the column being split feels natural here, this is not a requirement for Envision. It is possible to preserve the original column name, as illustrated by the following script:
read "/sample/products.csv" as Products with Product : text split:";" as Colors : text Price : number show table "Products" a1c3 with Products.Product Products.Colors // single color due to split! Products.Price
As discussed above, the
split option is still preceding the
as keyword. However, while the above script is technically correct, it is a bit confusing because the vector
Products.Colors contains only single-color entries, precisely due to the initial split.
Only a single
split option can be specified per table. This limitation is enforced by Envision in order to steer clear from pathological behaviors, where a single line in the original text file could end-up being multiplied into an arbitrary large number of generated lines. Ex: if two values on the same line were to be split into 10 sub-values each, then this would result in 10x10=100 pairs.
Advanced remarks: Envision text values are limited to 256 characters. However, the
split read option has been designed in such a way that it is possible to reach a flat file that happens to exhibit cell values larger than 256 characters, as long as those cells are split and that the resulting sub-values are individually shorter than 256 characters.
Sanitizing input files
Existing flat file exports may come with all sorts of quirks. For example, missing values may get written in the files as
null, or some numbers may come with an unusual representation. The role of a sanitization process is to extract a sensible subset of the input data that isn’t entirely sane. When facing somewhat pathological situations, the recommended path in Envision consists of reading the columns as
text, and then applying the parsing logic within the script itself.
Let’s consider the file
products.csv generated by the following script:
table Products = with [| as Product, as Size |] [| "shirt", "4,5" |] [| "pants", "null" |] [| "hat", "3,2" |] show table "Products" export: "/sample/products.csv" with Products.Product Products.Size
Size column contains two oddities. First, there are
null values that cannot be parsed into numbers in a meaningful way. Second, the decimal separator is the comma (
,) instead of being the default dot (
The file can then be read with
Size as a raw text column as illustrated by the following script:
read "/sample/products.csv" as Products with Product : text "Size" as SizeRaw : text Products.IsSizeDefined, Products.Size = tryParseNumber(Products.SizeRaw, "", ",") show table "Products" a1c3 with Products.Product Products.SizeRaw Products.IsSizeDefined Products.Size
tryParseNumber takes three arguments: first, the text values to be tentatively parsed as numbers, second, the thousand’s separator, third, the decimal separator. It returns two arguments as a tuple. The first value of the tuple is a
boolean that indicates whether the text value has been successfully parsed as a number. The second value of the table is a
number, which gives the result of the parsing, but only when the parsing has been successful, and otherwise is the default number value (i.e. zero).
Envision also provides the function
parseNumber, which always expects the parsing operation to succeed (and fail otherwise) as well as the functions
tryParseDate, which are the counterparts to what we have seen in this section but for dates instead.