Read and write files
Envision processes input data consolidated as files within the Lokad platform. In turn, the results obtained through Envision can be put on display within dashboards, as we have done so far, but also exported as files as well. Lokad supports diverse file formats such as Microsoft Excel or CSV (comma separated values) with all its variants. The read-write capabilities are also intended to support internal data preparation pipelines within a Lokad account. When files are only intended for internal consumption within the Lokad account, a specific binary format, named Ionic, should be favored for performance.
Table of contents
Read and write overview
Let’s illustrate the read-write capabilities of Envision with two scripts. The first script generates a CSV file and the second script reads the file that has just been generated by the first script:
table Products = with
[| as Product, as Color, as Price |]
[| "shirt", "white,grey", 10.50 |]
[| "pants", "blue", 15.00 |]
[| "hat", "red", 5.25 |]
write Products as "/sample/products.csv" with
Product = Products.Product
Color = Products.Color
Price = if Products.Price > 0 then Products.Price else 100
In the script above, the table comprehension Products
consolidates the table intended for the write operation. The write
statement indicates the source table Products
followed by the keyword as
and the path of the file to be exported. The assignments introduced within the write
block define the fields to be included in the written file. The field Color
is defined with the vector Products.Color
. The field Price
is defined with an expression aligned with the Products
table.
This file can, in turn, be read and displayed as illustrated by the second script:
read "/sample/products.csv" as Products max 1m with
Product : text
Color : text
Price : number
show table "My Products" a1b3 with
Products.Product
Products.Color
Products.Price
The above script starts with a read
statement that includes a pattern that immediately follows the keyword read
followed by the name of the table introduced by the keyword as
. The list of fields that are expected to be found in the file are listed after the with
, one field per line, specifying the name of the field followed by the semicolon :
introducing the field data type.
After running the first script, it is possible to confirm the existence of the file /sample/products.csv
by visiting the Files
tab within the Lokad account. Open the sample
folder and download the file named products.csv
. Locally opening the file with a plain text editor reveals that its content is:
Product,Color,Price
shirt,"white,grey",10.5
pants,blue,15
hat,red,5.25
This file follows the usual CSV conventions, the first line being the header and the later lines being the data. The fields are separated by the comma symbol ,
. In particular, the cell value white,grey
which contains a comma is escaped through quotation marks, which, again, is the usual practice to format CSV files. There are plenty of subtleties related to parsing and formatting flat text files, those are discussed in greater detail in the following.
In a read
statement, the ordering of the fields does not matter and does not have to match the ordering of the fields in the file itself. Also, the read
statement is not required to list all the fields found in the file. Fields that are not used in the script can be omitted from the read
statement.
In its simplest form, the read
statement takes a tabular file and loads it as a table within Envision. There are no particular restrictions. A single file can be absurdly large (e.g. 100GB) and/or absurdly long (e.g. 1 billion lines), although this is not a recommended practice. Smaller files are more manageable, and Envision provides capabilities to read data from multiple files, which will be reviewed in the following.
The file paths in Lokad follows the usual Linux conventions, most notably that the root folder is /
and that folders are delimited with /
, that paths are case-sensitive and that whitespaces (and Unicode characters) are allowed inside a file or folder name.
Flat text files support most simple datatypes such as text
, number
, date
or boolean
. More complex, and more Envision-specific datatypes, can also be stored in files, but require specialized formats which are detailed in the following.
Roadmap: The write
statement is now the preferred approach when the intent is to precisely write a file instead of displaying a table within the dashboard. In the earlier versions of Envision, writing files could only be achieved through the write
option to be specified on a show table
block.
Read aliasing
A read alias is a field being introduced multiple times under distinct names. This feature is intended to cope with edge-cases that involve primary dimensions or table schemas (detailed later in this chapter).
read "/sample/products.csv" as Products[id] max 1m with
Product : text
Color : text
Price : number
id = Products.Product
show table "My Products" a1b3 with
id
Products.Color
Products.Price
In the above script, an alias of the vector Products.Products
is introduced under the name Products.id
. The read
statement turns this vector into the primary dimension of the table Products
.
A read alias does not have to involve a primary dimension. However, if there is neither a primary dimension nor a table schema involved, then a simple assignement after the read block produces the same effect. As a rule of thumb, we suggest to restrict the use of read aliases to situations that cannot be addressed with a vector assignement.
Filesystem atomicity
The state of the filesystem, as perceived by an Envision script, is never modified during the execution of the script. This property eliminates an entire class of concurrency issues from Envision. However, this behavior differs notably from the procedural semantic which is observed with most regular programming languages.
All the read
statements must happen at the very beginning of the script execution, right after the propagation of the constants. The ordering the read
statements does not matter (except as a syntactic sugar to ellide columns, more on this in the following).
read "/sample/products.csv" as Products max 1m with
Product : text
Color : text
Price : number
show table "My Products" a1b3 with // not very readable
Products.Product
Products.Color
Products.Price
Conversely, all the write
statements happen at the very end of the script execution. The files being written are not accessible - in their latest version - until the script completes. In particular, the ordering of the write
statements has no impact at all on the files produced by the script. If two write
statements happen to be targeting the same file, the script execution will fail. The following script fails, and this failure illustrates this behavior. The read
statement does not “perceive” the newly written file associated with the write
block, as this file will only be added to the filesystem once the script completes.
read "/sample/products2.csv" as Products2 max 1m with
Product : text
Color : text
Price : number
table Products = with
[| as Product, as Color, as Price |]
[| "shirt", "white,grey", 10.50 |]
[| "pants", "blue", 15.00 |]
[| "hat", "red", 5.25 |]
write Products as "/sample/products2.csv" with
Product = Products.Product
Color = Products.Color
Price = Products.Price
show table "My Products" a1b3 with
Products2.Product
Products2.Color
Products2.Price
This behavior explains why those two scripts of the previous section have to be run in sequence, otherwise the file read by the second script does not exist yet. By design, it is not possible, within a single execution of a single Envision script, to read a file written by write
statement found within the very same script.
Finally, if a script writes several files, then, either the script succeeds and all the files get written in the filesystem; or, the script fails and none of the files get written. Moreover, no other script can ever read the files while those files are being written. Changes to the filesystem brought by an Envision are performed atomically. The Envision design eliminates race conditions between read and write operations.
A file-centric platform
Unlike the majority of enterprise apps that are database-centric, the Lokad platform is file-centric. While the Lokad platform supports a series of built-in data connectors, also maintained by Lokad, dedicated to third-party apps, those connectors are simply gathering data collected through APIs (application programming interfaces) into files suitable for Envision processing. Moreover, unless Lokad offers built-in support for a given third-party app, files are expected to be pushed by the client company through protocols like FTPS or SFTP.
There are profound reasons why Lokad adopted this file-centric design. The first reason is third-party capabilities, that is, not the capacity of Lokad itself to process large amounts of data, but the capacity of most third-party actors (or their apps) in the market to export large amounts of data toward Lokad. As a rule of thumb, with flat file exports even badly designed systems tend to operate faster than 10MB/second, while with relational databases and web APIs even soundly designed systems struggle to achieve 10MB/second. We casually observe (supposedly) ERP market leaders struggling to even sustain 1 record / second on all their web API endpoints. The reasons for this state of affairs are beyond the scope of the present document, but suffice to say that designing scalable APIs is a difficult undertaking, and resources have usually not been invested on that front. Thus, for Lokad to be able to process incoming flat files by necessity, as it has frequently been the case, it became the only viable option available for data exports for our larger clients.
The second reason is correctness by design. The Lokad file system as perceived by Envision is immutable: for the entire duration of the script process, all the files and all their content do not change. Without this property, any script reading a file being (over)written, - either because the file is currently being uploaded through SFTP or because the file is being generated by another script - would read corrupted data. As soon as a company is dealing with more than 1GB of data, this problem becomes frequent. Thus, under the hood, the Lokad file system acts as a file versioning system much like Git. When a script starts, its process gets a persistent affinity with a specific version of the file system, preventing any data change to surface into the Envision processing. However, while the script is running, files in Lokad can still be modified, the script does not lock the files. Whenever files are modified, a new version of the file system is created accordingly. Finally, the output files produced by an Envision script are committed atomically to the file system. This implies that another script cannot see “some” files already refreshed while the other files are not.
The third reason is its low operating costs. Files, ideally suitably compressed, are among the least expensive options to store and process large amounts of data, especially when the workload is highly intermittent. Indeed, most decisions optimized through Lokad (e.g. quantities purchased or produced, updated prices) don’t usually need to be refreshed more than once per day. Thus, it’s important to make sure that the cost of maintaining an inactive system is almost as low as the cost of raw BLOB storage (binary large objects).
The main drawback of storing data as (large) files is that, by design, low latency fine-grained data accesses don’t really fit in this perspective. As pointed out, Lokad is geared toward batch processing rather than online processing.
Advanced remark: The datatype is explicit for each field of the read
statements. This design is not an accident, Envision guarantees that there are no type ambiguities in a script that could only be resolved by inspecting the underlying data in the files themselves.
Roadmap: the read
statements in Envision are mixing - and coupling - both file patterns (i.e. choosing which files are read) and file formats (i.e. how to read the files). We plan to revisit this design in the future to decouple these two concerns. This would not only clarify the language design, but also make it more expressive overall.