...

Read and write files

Envision processes input data consolidated as files within the Lokad platform. In turn, the results obtained through Envision can be put on display within dashboards, as we have done so far, but also exported as files as well. Lokad supports diverse file formats such as Microsoft Excel or CSV (comma separated values) with all its variants. The read-write capabilities are also intended to support internal data preparation pipelines within a Lokad account. When files are only intended for internal consumption within the Lokad account, a specific binary format, named Ionic, should be favored for performance.

Table of contents

Read and write overview

Let’s illustrate the read-write capabilities of Envision with two scripts. The first script generates a CSV file and the second script reads the file that has just been generated by the first script:

table Products = with
  [| as Product, as Color, as Price |]
  [| "shirt", "white,grey", 10.50 |]
  [| "pants", "blue", 15.00 |]
  [| "hat", "red", 5.25 |]

write Products as "/sample/products.csv" with
  Product = Products.Product
  Color = Products.Color
  Price = if Products.Price > 0 then Products.Price else 100

In the script above, the table comprehension Products consolidates the table intended for the write operation. The write statement indicates the source table Products followed by the keyword as and the path of the file to be exported. The assignments introduced within the write block define the fields to be included in the written file. The field Color is defined with the vector Products.Color. The field Price is defined with an expression aligned with the Products table.

This file can, in turn, be read and displayed as illustrated by the second script:

read "/sample/products.csv" as Products max 1m with
  Product : text
  Color : text
  Price : number

show table "My Products" a1b3 with
  Products.Product
  Products.Color
  Products.Price

The above script starts with a read statement that includes a pattern that immediately follows the keyword read followed by the name of the table introduced by the keyword as. The list of fields that are expected to be found in the file are listed after the with, one field per line, specifying the name of the field followed by the semicolon : introducing the field data type.

After running the first script, it is possible to confirm the existence of the file /sample/products.csv by visiting the Files tab within the Lokad account. Open the sample folder and download the file named products.csv. Locally opening the file with a plain text editor reveals that its content is:

Product,Color,Price
shirt,"white,grey",10.5
pants,blue,15
hat,red,5.25

This file follows the usual CSV conventions, the first line being the header and the later lines being the data. The fields are separated by the comma symbol ,. In particular, the cell value white,grey which contains a comma is escaped through quotation marks, which, again, is the usual practice to format CSV files. There are plenty of subtleties related to parsing and formatting flat text files, those are discussed in greater detail in the following.

In a read statement, the ordering of the fields does not matter and does not have to match the ordering of the fields in the file itself. Also, the read statement is not required to list all the fields found in the file. Fields that are not used in the script can be omitted from the read statement.

In its simplest form, the read statement takes a tabular file and loads it as a table within Envision. There are no particular restrictions. A single file can be absurdly large (e.g. 100GB) and/or absurdly long (e.g. 1 billion lines), although this is not a recommended practice. Smaller files are more manageable, and Envision provides capabilities to read data from multiple files, which will be reviewed in the following.

The file paths in Lokad follows the usual Linux conventions, most notably that the root folder is / and that folders are delimited with /, that paths are case-sensitive and that whitespaces (and Unicode characters) are allowed inside a file or folder name.

Flat text files support most simple datatypes such as text, number, date or boolean. More complex, and more Envision-specific datatypes, can also be stored in files, but require specialized formats which are detailed in the following.

Roadmap: The write statement is now the preferred approach when the intent is to precisely write a file instead of displaying a table within the dashboard. In the earlier versions of Envision, writing files could only be achieved through the write option to be specified on a show table block.

Read aliasing

A read alias is a field being introduced multiple times under distinct names. This feature is intended to cope with edge-cases that involve primary dimensions or table schemas (detailed later in this chapter).

read "/sample/products.csv" as Products[id] max 1m with
  Product : text
  Color : text
  Price : number
  id = Products.Product

show table "My Products" a1b3 with
  id
  Products.Color
  Products.Price

In the above script, an alias of the vector Products.Products is introduced under the name Products.id. The read statement turns this vector into the primary dimension of the table Products.

A read alias does not have to involve a primary dimension. However, if there is neither a primary dimension nor a table schema involved, then a simple assignement after the read block produces the same effect. As a rule of thumb, we suggest to restrict the use of read aliases to situations that cannot be addressed with a vector assignement.

Filesystem atomicity

The state of the filesystem, as perceived by an Envision script, is never modified during the execution of the script. This property eliminates an entire class of concurrency issues from Envision. However, this behavior differs notably from the procedural semantic which is observed with most regular programming languages.

All the read statements must happen at the very beginning of the script execution, right after the propagation of the constants. The ordering the read statements does not matter (except as a syntactic sugar to ellide columns, more on this in the following).

read "/sample/products.csv" as Products max 1m with
  Product : text
  Color : text
  Price : number

show table "My Products" a1b3 with // not very readable
  Products.Product
  Products.Color
  Products.Price

Conversely, all the write statements happen at the very end of the script execution. The files being written are not accessible - in their latest version - until the script completes. In particular, the ordering of the write statements has no impact at all on the files produced by the script. If two write statements happen to be targeting the same file, the script execution will fail. The following script fails, and this failure illustrates this behavior. The read statement does not “perceive” the newly written file associated with the write block, as this file will only be added to the filesystem once the script completes.

read "/sample/products2.csv" as Products2 max 1m with
  Product : text
  Color : text
  Price : number

table Products = with
  [| as Product, as Color, as Price |]
  [| "shirt", "white,grey", 10.50 |]
  [| "pants", "blue", 15.00 |]
  [| "hat", "red", 5.25 |]

write Products as "/sample/products2.csv" with
  Product = Products.Product
  Color = Products.Color
  Price = Products.Price

show table "My Products" a1b3 with
  Products2.Product
  Products2.Color
  Products2.Price

This behavior explains why those two scripts of the previous section have to be run in sequence, otherwise the file read by the second script does not exist yet. By design, it is not possible, within a single execution of a single Envision script, to read a file written by write statement found within the very same script.

Finally, if a script writes several files, then, either the script succeeds and all the files get written in the filesystem; or, the script fails and none of the files get written. Moreover, no other script can ever read the files while those files are being written. Changes to the filesystem brought by an Envision are performed atomically. The Envision design eliminates race conditions between read and write operations.

A file-centric platform

Unlike the majority of enterprise apps that are database-centric, the Lokad platform is file-centric. While the Lokad platform supports a series of built-in data connectors, also maintained by Lokad, dedicated to third-party apps, those connectors are simply gathering data collected through APIs (application programming interfaces) into files suitable for Envision processing. Moreover, unless Lokad offers built-in support for a given third-party app, files are expected to be pushed by the client company through protocols like FTPS or SFTP.

There are profound reasons why Lokad adopted this file-centric design. The first reason is third-party capabilities, that is, not the capacity of Lokad itself to process large amounts of data, but the capacity of most third-party actors (or their apps) in the market to export large amounts of data toward Lokad. As a rule of thumb, with flat file exports even badly designed systems tend to operate faster than 10MB/second, while with relational databases and web APIs even soundly designed systems struggle to achieve 10MB/second. We casually observe (supposedly) ERP market leaders struggling to even sustain 1 record / second on all their web API endpoints. The reasons for this state of affairs are beyond the scope of the present document, but suffice to say that designing scalable APIs is a difficult undertaking, and resources have usually not been invested on that front. Thus, for Lokad to be able to process incoming flat files by necessity, as it has frequently been the case, it became the only viable option available for data exports for our larger clients.

The second reason is correctness by design. The Lokad file system as perceived by Envision is immutable: for the entire duration of the script process, all the files and all their content do not change. Without this property, any script reading a file being (over)written, - either because the file is currently being uploaded through SFTP or because the file is being generated by another script - would read corrupted data. As soon as a company is dealing with more than 1GB of data, this problem becomes frequent. Thus, under the hood, the Lokad file system acts as a file versioning system much like Git. When a script starts, its process gets a persistent affinity with a specific version of the file system, preventing any data change to surface into the Envision processing. However, while the script is running, files in Lokad can still be modified, the script does not lock the files. Whenever files are modified, a new version of the file system is created accordingly. Finally, the output files produced by an Envision script are committed atomically to the file system. This implies that another script cannot see “some” files already refreshed while the other files are not.

The third reason is its low operating costs. Files, ideally suitably compressed, are among the least expensive options to store and process large amounts of data, especially when the workload is highly intermittent. Indeed, most decisions optimized through Lokad (e.g. quantities purchased or produced, updated prices) don’t usually need to be refreshed more than once per day. Thus, it’s important to make sure that the cost of maintaining an inactive system is almost as low as the cost of raw BLOB storage (binary large objects).

The main drawback of storing data as (large) files is that, by design, low latency fine-grained data accesses don’t really fit in this perspective. As pointed out, Lokad is geared toward batch processing rather than online processing.

Advanced remark: The datatype is explicit for each field of the read statements. This design is not an accident, Envision guarantees that there are no type ambiguities in a script that could only be resolved by inspecting the underlying data in the files themselves.

Roadmap: the read statements in Envision are mixing - and coupling - both file patterns (i.e. choosing which files are read) and file formats (i.e. how to read the files). We plan to revisit this design in the future to decouple these two concerns. This would not only clarify the language design, but also make it more expressive overall.

User Contributed Notes
0 notes + add a note