Read and write files

Envision processes input data consolidated as files within the Lokad platform. In turn, the results obtained through Envision can be put on display within dashboards, as we have done so far, but also exported as files as well. Lokad supports diverse file formats such as Microsoft Excel or CSV (comma separated values) with all its variants.

Let’s illustrate this idea with two scripts. The first script generates a CSV file and the second script reads the file that has just been generated by the first script:

table Products = with
  [| as Product, as Color, as Price |]
  [| "shirt", "white,grey", 10.50 |]
  [| "pants", "blue", 15.00 |]
  [| "hat", "red", 5.25 |]

show table "My Export" export: "/sample/products.csv" with
  Products.Product
  Products.Color
  Products.Price

In the script above, the table comprehension Products consolidates the table intended for the export. The tile table is marked with the option export followed by the path of the file to be exported. This file can, in turn, be read and displayed as illustrated by the second script:

read "/sample/products.csv" as Products with
  Product : text
  Color : text
  Price : number

show table "My Display" a1b3 with
  Products.Product
  Products.Color
  Products.Price

The above script starts with a read statement that includes a pattern that immediately follows the keyword read followed by the name of the table introduced by the keyword as. The list of columns that are expected to be found in the file are listed after the with, one column per line, specifying the name of the column followed by the semicolon : introducing the column data type.

In order to work, those two scripts need to be run in sequence, otherwise the file read by the second script does not exist yet. After running the first script, it is possible to confirm the existence of the file /sample/products.tsv by visiting the Files tab within the Lokad account. Open the sample folder and download the file named products.csv. Locally opening the file with a plain text editor reveals that its content is:

Product,Color,Price
shirt,"white,grey",10.5
pants,blue,15
hat,red,5.25

This file follows the usual CSV conventions, the first line being the header and the later lines being the data. The columns are separated by the comma symbol ,. In particular, the cell value white,grey which contains a comma is escaped through quotation marks, which, again, is the usual practice to format CSV files. There are plenty of subtleties related to parsing and formatting flat text files, those are discussed in greater detail in the following.

In a read statement, the ordering of the columns does not matter and does not have to match the ordering of the columns in the file itself. Also, the read statement is not required to list all the columns found in the file. Columns that are not used in the script can be omitted from the read statement.

In its simplest form, the read statement takes a tabular file and loads it as a table within Envision. There are no particular restrictions. A single file can be absurdly large (e.g. 100GB) and/or absurdly long (e.g. 1 billion lines), although this is not a recommended practice. Smaller files are more manageable, and Envision provides capabilities to read data from multiple files, which will be reviewed in the following.

The file paths in Lokad follows the usual Linux conventions, most notably that the the root folder is / and that folders are delimited with /, that paths are case-sensitive and that whitespaces (and Unicode characters) are allowed inside a file or folder name.

Flat text files support most simple datatypes such as text, number, date or boolean. More complex, and more Envision-specific datatypes, can also be stored in files, but require specialized formats which are detailed in the following.

Unlike the majority of enterprise apps that are database-centric, the Lokad platform is file-centric. While the Lokad platform supports a series of built-in data connectors, also maintained by Lokad, dedicated to third-party apps, those connectors are simply gathering data collected through APIs (application programming interfaces) into files suitable for Envision processing. Moreover, unless Lokad offers built-in support for a given third-party app, files are expected to be pushed by the client company through protocols like FTPS or SFTP.

There are profound reasons why Lokad adopted this file-centric design. The first reason is third-party capabilities, that is, not the capacity of Lokad itself to process large amounts of data, but the capacity of most third-party actors (or their apps) in the market to export large amounts of data toward Lokad. As a rule of thumb, with flat file exports even badly designed systems tend to operate faster than 10MB/second, while with relational databases and web APIs even soundly designed systems struggle to achieve 10MB/second. We casually observe (supposedly) ERP market leaders struggling to even sustain 1 record / second on all their web API endpoints. The reasons for this state of affairs are beyond the scope of the present document, but suffice to say that designing scalable APIs is a difficult undertaking, and resources have usually not been invested on that front. Thus, for Lokad to be able to process incoming flat files by necessity, as it has frequently been the case, it became the only viable option available for data exports for our larger clients.

The second reason is correctness by design. The Lokad file system as perceived by Envision is immutable: for the entire duration of the script process, all the files and all their content do not change. Without this property, any script reading a file being (over)written, - either because the file is currently being uploaded through SFTP or because the file is being generated by another script - would read corrupted data. As soon as a company is dealing with more than 1GB of data, this problem becomes frequent. Thus, under the hood, the Lokad file system acts as a file versioning system much like Git. When a script starts, its process gets a persistent affinity with a specific version of the file system, preventing any data change to surface into the Envision processing. However, while the script is running, files in Lokad can still be modified, the script does not lock the files. Whenever files are modified, a new version of the file system is created accordingly. Finally, the output files produced by an Envision script are committed atomically to the file system. This implies that another script cannot see “some” files already refreshed while the other files are not.

The third reason is its low operating costs. Files, ideally suitably compressed, are among the least expensive options to store and process large amounts of data, especially when the workload is highly intermittent. Indeed, most decisions optimized through Lokad (e.g. quantities purchased or produced, updated prices) don’t usually need to be refreshed more than once per day. Thus, it’s important to make sure that the cost of maintaining an inactive system is almost as low as the cost of raw BLOB storage (binary large objects).

The main drawback of storing data as (large) files is that, by design, low latency fine-grained data accesses don’t really fit in this perspective. As pointed out, Lokad is geared toward batch processing rather than online processing.

Advanced remark: The datatype is explicit for each column of the read statements. This design is not an accident, Envision guarantees that there are no type ambiguities in a script that could only be resolved by inspecting the underlying data in the files themselves.

Roadmap: the read statements in Envision are mixing - and coupling - both file patterns (i.e. choosing which files are read) and file formats (i.e. how to read the files). We plan to revisit this design in the future to decouple these two concerns. This would not only clarify the language design, but also make it more expressive overall.