Read and write files
Envision processes input data consolidated as files within the Lokad platform. In turn, the results obtained through Envision can be put on display within dashboards, as we have done so far, but also exported as files as well. Lokad supports diverse file formats such as Microsoft Excel or CSV (comma separated values) with all its variants. The read-write capabilities are also intended to support internal data preparation pipelines within a Lokad account. When files are only intended for internal consumption within the Lokad account, a specific binary format, named Ionic, should be favored for performance.
Table of contents
Read and write overview
Let’s illustrate the read-write capabilities of Envision with two scripts. The first script generates a CSV file and the second script reads the file that has just been generated by the first script:
table Products = with [| as Product, as Color, as Price |] [| "shirt", "white,grey", 10.50 |] [| "pants", "blue", 15.00 |] [| "hat", "red", 5.25 |] write Products as "/sample/products.csv" with Product = Products.Product Color = Products.Color Price = if Products.Price > 0 then Products.Price else 100
In the script above, the table comprehension
Products consolidates the table intended for the write operation. The
write statement indicates the source table
Products followed by the keyword
as and the path of the file to be exported. The assignments introduced within the
write block define the fields to be included in the written file. The field
Color is defined with the vector
Products.Color. The field
Price is defined with an expression aligned with the
This file can, in turn, be read and displayed as illustrated by the second script:
read "/sample/products.csv" as Products max 1m with Product : text Color : text Price : number show table "My Products" a1b3 with Products.Product Products.Color Products.Price
The above script starts with a
read statement that includes a pattern that immediately follows the keyword
read followed by the name of the table introduced by the keyword
as. The list of fields that are expected to be found in the file are listed after the
with, one field per line, specifying the name of the field followed by the semicolon
: introducing the field data type.
After running the first script, it is possible to confirm the existence of the file
/sample/products.csv by visiting the
Files tab within the Lokad account. Open the
sample folder and download the file named
products.csv. Locally opening the file with a plain text editor reveals that its content is:
Product,Color,Price shirt,"white,grey",10.5 pants,blue,15 hat,red,5.25
This file follows the usual CSV conventions, the first line being the header and the later lines being the data. The fields are separated by the comma symbol
,. In particular, the cell value
white,grey which contains a comma is escaped through quotation marks, which, again, is the usual practice to format CSV files. There are plenty of subtleties related to parsing and formatting flat text files, those are discussed in greater detail in the following.
read statement, the ordering of the fields does not matter and does not have to match the ordering of the fields in the file itself. Also, the
read statement is not required to list all the fields found in the file. Fields that are not used in the script can be omitted from the
In its simplest form, the
read statement takes a tabular file and loads it as a table within Envision. There are no particular restrictions. A single file can be absurdly large (e.g. 100GB) and/or absurdly long (e.g. 1 billion lines), although this is not a recommended practice. Smaller files are more manageable, and Envision provides capabilities to read data from multiple files, which will be reviewed in the following.
The file paths in Lokad follows the usual Linux conventions, most notably that the root folder is
/ and that folders are delimited with
/, that paths are case-sensitive and that whitespaces (and Unicode characters) are allowed inside a file or folder name.
Flat text files support most simple datatypes such as
boolean. More complex, and more Envision-specific datatypes, can also be stored in files, but require specialized formats which are detailed in the following.
write statement is now the preferred approach when the intent is to precisely write a file instead of displaying a table within the dashboard. In the earlier versions of Envision, writing files could only be achieved through the
write option to be specified on a
show table block.
A read alias is a field being introduced multiple times under distinct names. This feature is intended to cope with edge-cases that involve primary dimensions or table schemas (detailed later in this chapter).
read "/sample/products.csv" as Products[id] max 1m with Product : text Color : text Price : number id = Products.Product show table "My Products" a1b3 with id Products.Color Products.Price
In the above script, an alias of the vector
Products.Products is introduced under the name
read statement turns this vector into the primary dimension of the table
A read alias does not have to involve a primary dimension. However, if there is neither a primary dimension nor a table schema involved, then a simple assignement after the read block produces the same effect. As a rule of thumb, we suggest to restrict the use of read aliases to situations that cannot be addressed with a vector assignement.
The state of the filesystem, as perceived by an Envision script, is never modified during the execution of the script. This property eliminates an entire class of concurrency issues from Envision. However, this behavior differs notably from the procedural semantic which is observed with most regular programming languages.
read statements must happen at the very beginning of the script execution, right after the propagation of the constants. The ordering the
read statements does not matter (except as a syntactic sugar to ellide columns, more on this in the following).
read "/sample/products.csv" as Products max 1m with Product : text Color : text Price : number show table "My Products" a1b3 with // not very readable Products.Product Products.Color Products.Price
Conversely, all the
write statements happen at the very end of the script execution. The files being written are not accessible - in their latest version - until the script completes. In particular, the ordering of the
write statements has no impact at all on the files produced by the script. If two
write statements happen to be targeting the same file, the script execution will fail. The following script fails, and this failure illustrates this behavior. The
read statement does not “perceive” the newly written file associated with the
write block, as this file will only be added to the filesystem once the script completes.
read "/sample/products2.csv" as Products2 max 1m with Product : text Color : text Price : number table Products = with [| as Product, as Color, as Price |] [| "shirt", "white,grey", 10.50 |] [| "pants", "blue", 15.00 |] [| "hat", "red", 5.25 |] write Products as "/sample/products2.csv" with Product = Products.Product Color = Products.Color Price = Products.Price show table "My Products" a1b3 with Products2.Product Products2.Color Products2.Price
This behavior explains why those two scripts of the previous section have to be run in sequence, otherwise the file read by the second script does not exist yet. By design, it is not possible, within a single execution of a single Envision script, to read a file written by
write statement found within the very same script.
Finally, if a script writes several files, then, either the script succeeds and all the files get written in the filesystem; or, the script fails and none of the files get written. Moreover, no other script can ever read the files while those files are being written. Changes to the filesystem brought by an Envision are performed atomically. The Envision design eliminates race conditions between read and write operations.
A file-centric platform
Unlike the majority of enterprise apps that are database-centric, the Lokad platform is file-centric. While the Lokad platform supports a series of built-in data connectors, also maintained by Lokad, dedicated to third-party apps, those connectors are simply gathering data collected through APIs (application programming interfaces) into files suitable for Envision processing. Moreover, unless Lokad offers built-in support for a given third-party app, files are expected to be pushed by the client company through protocols like FTPS or SFTP.
There are profound reasons why Lokad adopted this file-centric design. The first reason is third-party capabilities, that is, not the capacity of Lokad itself to process large amounts of data, but the capacity of most third-party actors (or their apps) in the market to export large amounts of data toward Lokad. As a rule of thumb, with flat file exports even badly designed systems tend to operate faster than 10MB/second, while with relational databases and web APIs even soundly designed systems struggle to achieve 10MB/second. We casually observe (supposedly) ERP market leaders struggling to even sustain 1 record / second on all their web API endpoints. The reasons for this state of affairs are beyond the scope of the present document, but suffice to say that designing scalable APIs is a difficult undertaking, and resources have usually not been invested on that front. Thus, for Lokad to be able to process incoming flat files by necessity, as it has frequently been the case, it became the only viable option available for data exports for our larger clients.
The second reason is correctness by design. The Lokad file system as perceived by Envision is immutable: for the entire duration of the script process, all the files and all their content do not change. Without this property, any script reading a file being (over)written, - either because the file is currently being uploaded through SFTP or because the file is being generated by another script - would read corrupted data. As soon as a company is dealing with more than 1GB of data, this problem becomes frequent. Thus, under the hood, the Lokad file system acts as a file versioning system much like Git. When a script starts, its process gets a persistent affinity with a specific version of the file system, preventing any data change to surface into the Envision processing. However, while the script is running, files in Lokad can still be modified, the script does not lock the files. Whenever files are modified, a new version of the file system is created accordingly. Finally, the output files produced by an Envision script are committed atomically to the file system. This implies that another script cannot see “some” files already refreshed while the other files are not.
The third reason is its low operating costs. Files, ideally suitably compressed, are among the least expensive options to store and process large amounts of data, especially when the workload is highly intermittent. Indeed, most decisions optimized through Lokad (e.g. quantities purchased or produced, updated prices) don’t usually need to be refreshed more than once per day. Thus, it’s important to make sure that the cost of maintaining an inactive system is almost as low as the cost of raw BLOB storage (binary large objects).
The main drawback of storing data as (large) files is that, by design, low latency fine-grained data accesses don’t really fit in this perspective. As pointed out, Lokad is geared toward batch processing rather than online processing.
Advanced remark: The datatype is explicit for each field of the
read statements. This design is not an accident, Envision guarantees that there are no type ambiguities in a script that could only be resolved by inspecting the underlying data in the files themselves.
read statements in Envision are mixing - and coupling - both file patterns (i.e. choosing which files are read) and file formats (i.e. how to read the files). We plan to revisit this design in the future to decouple these two concerns. This would not only clarify the language design, but also make it more expressive overall.